Odds of Winning Your Oscar Pool.

Mon 30 January 2017 by Steven E. Pav

In a previous blog post, I used a Bradley-Terry model to analyze Oscar Best Picture winners, using the best picture dataset. In that post I presented the results of likelihood tests which showed 'significant' relationships between winning the Best Picture category and conomination for other awards, MovieLens ratings, and (spuriously) number of IMDb votes. It can be hard to interpret the effect sizes and \(t\) statistics from a Bradley-Terry model. So here I will try to estimate the probability of correctly guessing the Best Picture winner using this model.

There is no apparent direct translation from the coefficients of the model fit to the probability of correctly forecasting a winner. Nor can you transform the maximized likelihood, or an R-squared. Moreover, it will depend on the number of nominees (traditionally there were only 5 Best Picture nominations--these days it's upwards of 9), and how they differ in the independent variables. Here I will keep it simple and use cross validation.

I modified the oslm code to include a predict method. So here, I load the data and the code, and remove duplicates and restrict the data to the period after 1945. I construct the model formula, based on co-nomination, then test in three ways:

A purely 'in sample' validation where all the data are used to build the model, then tested. (The film with the highest forecast probability of winning is chosen as the predicted winner, of course.) This should give the most optimistic view of performance, even though the likelihood maximization problem does not directly select for this metric.
A walk-forward cross validation where the data up through year \(y-1\) are used to build the model, then it is used to forecast the winners in year \(y\). This is perhaps the most honest kind of cross validation for time series, as it uses no future information to build models. However, during the early years of the time series the models are built with very little data and may not be representative of how the model will perform when built with the entire series now at hand. As such, it will be more pessimistic than the in-sample test.
A leave one (year) out cross validation, where all years except year \(y\) are used to build a model which is then used to forecast the winners in year \(y\), for each value of \(y\). The estimated performance from this test should be between walk-forward and in-sample in terms of optimism.

The code is fairly simple:

library(readr)
library(dplyr)
bpdf <- readr::read_csv('../data/best_picture_2.csv') %>%
    distinct(year,id,.keep_all=TRUE) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_')) %>%
    filter(year > 1945)

## Rows: 595 Columns: 55

## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (5): film, category, etc, imdb_index, title
## dbl (36): year, id, movie_id, ttid, production_year, votes, vote_mean, vote_...
## lgl (14): nominated_for_Writing, nominated_for_BestOriginalScore, nominated_...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

source('../code/oslm.R')

# the formula we settled on, using co-nominations.
fmla <- winner:year ~ nominated_for_Writing + nominated_for_BestDirector + 
    nominated_for_BestActress + nominated_for_BestActor + 
    nominated_for_BestFilmEditing

test_yrs <- 1960:max(bpdf$year)

# fuzz up a probabilities to break ties, then mark the winner
mark_winner <- function(prdf) {
    prdf %>%
        mutate(fuzzprob=prob+rnorm(length(prob),mean=0,sd=1e-9)) %>%
        group_by(year) %>%
            mutate(guess_winner=(fuzzprob==max(fuzzprob))) %>%
        ungroup() 
}

##### simulations

# in-sample
amod <- oslm(fmla,bpdf)
prd <- predict(amod,bpdf,has_y=TRUE)
iscv <- prd %>%
    filter(year %in% test_yrs) %>%
    mark_winner() %>%
    mutate(cv_pragma='is')

# walk forward cv
wfcv <- lapply(test_yrs,function(ayear) {
    amod <- oslm(fmla,bpdf %>% filter(year < ayear))
    foo <- predict(amod,bpdf %>% filter(year==ayear),has_y=TRUE)
}) %>%
    bind_rows() %>%
    mark_winner() %>%
    mutate(cv_pragma='wf')

# leave one year out
loocv <- lapply(test_yrs,function(ayear) {
    amod <- oslm(fmla,bpdf %>% filter(year != ayear))
    foo <- predict(amod,bpdf %>% filter(year==ayear),has_y=TRUE)
}) %>%
    bind_rows() %>%
    mark_winner() %>%
    mutate(cv_pragma='loo')

allcv <- rbind(iscv,wfcv,loocv)

Now let us look at the results. First the mean win probabilities for the three test pragmata:

allcv %>% 
    filter(guess_winner) %>% 
    group_by(cv_pragma) %>%
        summarize(mean_win_prob=mean(winner)) %>%
    ungroup() %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of forecasting winner, by simulation type',format='html')

Estimated probability of forecasting winner, by simulation type
cv_pragma	mean_win_prob
is	0.51
loo	0.42
wf	0.45

The simulated probabilities are in the 40-50% range. I had suspected they would be somewhat lower in the modern period when there were more films nominated, so I computed the mean win probability as a function of the number of nominees. Unfortunately there is very little data in that modern period (only 6 years in my dataset with more than 5 nominees). So instead, I tabulated by decade (a plot might have been better), and see no obvious strong effect:

# to do it by # nominees, try this:
#allcv %>% 
#   group_by(year,cv_pragma) %>%
#       mutate(n_nominees=n()) %>%
#   ungroup() %>%
#   filter(guess_winner) %>% 
#   group_by(cv_pragma,n_nominees) %>%
#       summarize(n_year=n(),mean_win_prob=mean(winner)) %>%
#   ungroup() %>%
#   kable()

# by decade instead:
allcv %>% 
    mutate(decade=paste0(((year %/% 10)) * 10,"'s")) %>%
    filter(guess_winner) %>% 
    group_by(cv_pragma,decade) %>%
        summarize(n_year=n(),mean_win_prob=mean(winner)) %>%
    ungroup() %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of forecasting winner, by simulation type and decade',format='html')

## `summarise()` has grouped output by 'cv_pragma'. You can override using the `.groups` argument.

Estimated probability of forecasting winner, by simulation type and decade
cv_pragma	decade	n_year	mean_win_prob
is	1960's	10	0.5
is	1970's	10	0.6
is	1980's	10	0.3
is	1990's	10	0.6
is	2000's	10	0.6
is	2010's	5	0.4
loo	1960's	10	0.5
loo	1970's	10	0.4
loo	1980's	10	0.4
loo	1990's	10	0.4
loo	2000's	10	0.5
loo	2010's	5	0.2
wf	1960's	10	0.7
wf	1970's	10	0.6
wf	1980's	10	0.3
wf	1990's	10	0.4
wf	2000's	10	0.5
wf	2010's	5	0.0

La La La, I can't hear you

My dataset did not have the 2016 award winners nor the 2017 nominees. As the data requirements for the conomination model are simple enough to gather from the AMPAS webpage, I put them together here and run them through our fit model to get estimated probabilities of winning the Best Picture. It looks like 'La La Land' is the clear favorite under this model with around 55% chance of winning, with 'Hacksaw Ridge' and 'Manchester by the Sea' around 10%. A quick scan of the betting markets confirms that 'La La Land' is the frontunner, so no real surprises here. However, the 100 to 1 odds quoted for 'Hacksaw Ridge' are perhaps not warranted.

library(readr)
library(dplyr)
bpdf <- readr::read_csv('../data/best_picture_2.csv') %>%
    distinct(year,id,.keep_all=TRUE) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_')) %>%
    filter(year > 1945)

## Rows: 595 Columns: 55

## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (5): film, category, etc, imdb_index, title
## dbl (36): year, id, movie_id, ttid, production_year, votes, vote_mean, vote_...
## lgl (14): nominated_for_Writing, nominated_for_BestOriginalScore, nominated_...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

source('../code/oslm.R')

# the formula we settled on, using co-nominations.
fmla <- winner:year ~ nominated_for_Writing + nominated_for_BestDirector + 
    nominated_for_BestActress + nominated_for_BestActor + 
    nominated_for_BestFilmEditing

library(tibble)
new_data <- tibble::tribble(
        ~film, ~nominated_for_Writing,  ~nominated_for_BestDirector, ~nominated_for_BestActress, ~nominated_for_BestActor, ~nominated_for_BestFilmEditing,
        "Arrival", FALSE, TRUE, FALSE, FALSE, TRUE,
        "Fences",FALSE, FALSE, FALSE, TRUE, FALSE,
        "Hacksaw Ridge",FALSE, TRUE, FALSE, TRUE, TRUE,
        "Hell or High Water", TRUE,  FALSE, FALSE, FALSE, TRUE,
        "Hidden Figures",FALSE, FALSE, FALSE, FALSE, FALSE,
        "La La Land", TRUE, TRUE, TRUE, TRUE, TRUE,
        "Lion",FALSE, FALSE, FALSE, FALSE, FALSE,
        "Manchester by the Sea", TRUE, TRUE, FALSE, TRUE, FALSE,
        "Moonlight",FALSE, TRUE, FALSE, FALSE, TRUE
        ) %>%
    mutate(year=2017) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_'))

amod <- oslm(fmla,bpdf)
newprd <- predict(amod,new_data,has_y=FALSE) %>% cbind(new_data %>% select(-year))

newprd %>% 
    select(year,film,prob) %>%
    arrange(desc(prob)) %>%
    rename(win_prob=prob) %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of winning 2017 Best Picture',format='html')

Estimated probability of winning 2017 Best Picture
year	film	win_prob
2017	La La Land	0.56
2017	Hacksaw Ridge	0.15
2017	Manchester by the Sea	0.10
2017	Arrival	0.08
2017	Moonlight	0.08
2017	Hell or High Water	0.03
2017	Fences	0.00
2017	Hidden Figures	0.00
2017	Lion	0.00

Predicting Best Picture Winners.

Thu 26 January 2017 by Steven E. Pav

In a previous blog post, I described some data I had put together for predicting winners in the Best Picture category of the Oscars. Here I will use a Bradley-Terry model to describe this dataset.

To test these models, I wrote an R function called oslm. I have posted the code. This code allows one to model the likelihood of winning an award as a function of some independent variables on each film, taking into account that one and only one film wins in a given year. The code supports computation of the likelihood function (and gradient and Hessian), and allows the maxLik package to do the heavy lifting. Supposing one has a data frame with boolean column winner to denote winners and year to hold the award year, and some independent variables, say x1, x2, and so on. Then one can invoke this code as

modl <- oslm(winner:year ~ x1 + x2,data=my_dataframe)

This is a bit heterodox, using the colon in the left hand side. However, I wasn't sure where else to put it, and the code was not too vile to write. Since I did not know the name of this model, I did not know what existing packages supported this kind of analysis, so I wrote my own silly function.

Who's a winner?

Let's use this data and code instead of talking about it. First, I load the data and then rid it of duplicates (sorry about those), and convert some Boolean independent variables to numeric. I source the oslm code and then try a very simple model: looking at films from 1950 onward, can I predict the probability of winning Best Picture in terms of the (log of the) number of votes it receives on IMDb, stored in the votes variable:

library(readr)
library(dplyr …

Best Picture?

Sun 22 January 2017 by Steven E. Pav

For a brief time I found myself working in the field of film analytics. One of our mad scientist type projects at the time was trying to predict which films might win an award. As a training exercise, we decided to analyze the Oscars.

With such a great beginning, you might be surprised to find the story does not end well. Collecting the data for such an analysis was a minor endeavor. At the time we had scraped and cobbled together a number of different databases about films, but connecting them to each other was a huge frustration. Around the time we would have been predicting the Oscars, the floor fell out from our funding and we were unemployed three weeks after they announced the Oscar 2015 winners.

Our loss is your gain, as I am now releasing the first cut of the data frame I was using. The data are available in a CSV file here. The columns are as follows:

year is the year of the Oscars.
category should always be Best Picture here.
film is the title.
etc is extra information to identify the film.
winner is a Boolean for whether the film won in that category.
id and movie_id are internal IDs, and have no use for you.
ttid is the best guess for the IMDb 'tt ID'.
title and production_year are from the IMDb data.
votes are the total number of votes in IMDb for the IMDb film. (This is an old cut of the data.)
vote_mean, vote_sd are the mean and standard deviation of user votes for the film in IMDb.
vote1 and vote10 are the proportion of 1- and 10-star votes for the film in IMDb.
I do not remember what series is.
total_gross is one estimate of gross receipts, and bom is …

IMDb Rating by Sex

Thu 21 July 2016 by Steven E. Pav

The nerdosphere is in a minor tizzy over a putative bias in IMDb ratings for the new (2016) Ghostbusters film. It seems a bit odd to me, since IMDb ratings have always been horribly 'biased': If the question you are trying to answer is "If I am forced to watch this randomly selected movie, will I like it?", then IMDb ratings, and most aggregated movie ratings are difficult to interpret, very likely 'biased'. The typical mechanism by which a rating ends up on IMDb is that a person somehow gains an awareness of the film (this has been the major problem for studios since the end of the studio-theatre model seventy years ago), enough so to view the film; they are then more likely to rate the movie if they liked it, or liked it more than expected it, or really hated it. Those who had low to middling opinions of the film are less likely to rate it, and so you have the problem of missing data, without the simplifying assumption of "missing at random."

The Ghostbusters argy bargy (or one of them) is that reviews are suspected to be coming from people who have not seen the movie. This is possibly a problem for all reviews on IMDb, though less so for reviews appearing in streaming services, who know when you have seen a film. (The other argy bargy is that sexist and racist jerks have been harassing stars of the new film.) The analysis on five thirty eight is informative, but uses information (e.g. age and sex of the reviewers) that is not widely available, and which is volunteered by the reviewers. Given the IMDb mirror at my disposal, I can look for systematic biases for films based on sex, and will do so here.

I …

Gilgamath