In a previous blog post, I used a Bradley-Terry model to analyze Oscar Best Picture winners, using the best picture dataset. In that post I presented the results of likelihood tests which showed 'significant' relationships between winning the Best Picture category and conomination for other awards, MovieLens ratings, and (spuriously) number of IMDb votes. It can be hard to interpret the effect sizes and \(t\) statistics from a Bradley-Terry model. So here I will try to estimate the probability of correctly guessing the Best Picture winner using this model.

There is no apparent direct translation from the coefficients of the model fit to the probability of correctly forecasting a winner. Nor can you transform the maximized likelihood, or an R-squared. Moreover, it will depend on the number of nominees (traditionally there were only 5 Best Picture nominations--these days it's upwards of 9), and how they differ in the independent variables. Here I will keep it simple and use cross validation.

I modified the oslm code to include a predict method. So here, I load the data and the code, and remove duplicates and restrict the data to the period after 1945. I construct the model formula, based on co-nomination, then test in three ways:

  • A purely 'in sample' validation where all the data are used to build the model, then tested. (The film with the highest forecast probability of winning is chosen as the predicted winner, of course.) This should give the most optimistic view of performance, even though the likelihood maximization problem does not directly select for this metric.
  • A walk-forward cross validation where the data up through year \(y-1\) are used to build the model, then it is used to forecast the winners in year \(y\). This is perhaps the most honest kind of cross validation for time series, as it uses no future information to build models. However, during the early years of the time series the models are built with very little data and may not be representative of how the model will perform when built with the entire series now at hand. As such, it will be more pessimistic than the in-sample test.
  • A leave one (year) out cross validation, where all years except year \(y\) are used to build a model which is then used to forecast the winners in year \(y\), for each value of \(y\). The estimated performance from this test should be between walk-forward and in-sample in terms of optimism.

The code is fairly simple:

library(readr)
library(dplyr)
bpdf <- readr::read_csv('../data/best_picture_2.csv') %>%
    distinct(year,id,.keep_all=TRUE) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_')) %>%
    filter(year > 1945)
## Rows: 595 Columns: 55
## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (5): film, category, etc, imdb_index, title
## dbl (36): year, id, movie_id, ttid, production_year, votes, vote_mean, vote_...
## lgl (14): nominated_for_Writing, nominated_for_BestOriginalScore, nominated_...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
source('../code/oslm.R')

# the formula we settled on, using co-nominations.
fmla <- winner:year ~ nominated_for_Writing + nominated_for_BestDirector + 
    nominated_for_BestActress + nominated_for_BestActor + 
    nominated_for_BestFilmEditing

test_yrs <- 1960:max(bpdf$year)

# fuzz up a probabilities to break ties, then mark the winner
mark_winner <- function(prdf) {
    prdf %>%
        mutate(fuzzprob=prob+rnorm(length(prob),mean=0,sd=1e-9)) %>%
        group_by(year) %>%
            mutate(guess_winner=(fuzzprob==max(fuzzprob))) %>%
        ungroup() 
}

##### simulations

# in-sample
amod <- oslm(fmla,bpdf)
prd <- predict(amod,bpdf,has_y=TRUE)
iscv <- prd %>%
    filter(year %in% test_yrs) %>%
    mark_winner() %>%
    mutate(cv_pragma='is')

# walk forward cv
wfcv <- lapply(test_yrs,function(ayear) {
    amod <- oslm(fmla,bpdf %>% filter(year < ayear))
    foo <- predict(amod,bpdf %>% filter(year==ayear),has_y=TRUE)
}) %>%
    bind_rows() %>%
    mark_winner() %>%
    mutate(cv_pragma='wf')

# leave one year out
loocv <- lapply(test_yrs,function(ayear) {
    amod <- oslm(fmla,bpdf %>% filter(year != ayear))
    foo <- predict(amod,bpdf %>% filter(year==ayear),has_y=TRUE)
}) %>%
    bind_rows() %>%
    mark_winner() %>%
    mutate(cv_pragma='loo')

allcv <- rbind(iscv,wfcv,loocv)

Now let us look at the results. First the mean win probabilities for the three test pragmata:

allcv %>% 
    filter(guess_winner) %>% 
    group_by(cv_pragma) %>%
        summarize(mean_win_prob=mean(winner)) %>%
    ungroup() %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of forecasting winner, by simulation type',format='html')
Estimated probability of forecasting winner, by simulation type
cv_pragma mean_win_prob
is 0.51
loo 0.42
wf 0.45

The simulated probabilities are in the 40-50% range. I had suspected they would be somewhat lower in the modern period when there were more films nominated, so I computed the mean win probability as a function of the number of nominees. Unfortunately there is very little data in that modern period (only 6 years in my dataset with more than 5 nominees). So instead, I tabulated by decade (a plot might have been better), and see no obvious strong effect:

# to do it by # nominees, try this:
#allcv %>% 
#   group_by(year,cv_pragma) %>%
#       mutate(n_nominees=n()) %>%
#   ungroup() %>%
#   filter(guess_winner) %>% 
#   group_by(cv_pragma,n_nominees) %>%
#       summarize(n_year=n(),mean_win_prob=mean(winner)) %>%
#   ungroup() %>%
#   kable()

# by decade instead:
allcv %>% 
    mutate(decade=paste0(((year %/% 10)) * 10,"'s")) %>%
    filter(guess_winner) %>% 
    group_by(cv_pragma,decade) %>%
        summarize(n_year=n(),mean_win_prob=mean(winner)) %>%
    ungroup() %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of forecasting winner, by simulation type and decade',format='html')
## `summarise()` has grouped output by 'cv_pragma'. You can override using the `.groups` argument.
Estimated probability of forecasting winner, by simulation type and decade
cv_pragma decade n_year mean_win_prob
is 1960's 10 0.5
is 1970's 10 0.6
is 1980's 10 0.3
is 1990's 10 0.6
is 2000's 10 0.6
is 2010's 5 0.4
loo 1960's 10 0.5
loo 1970's 10 0.4
loo 1980's 10 0.4
loo 1990's 10 0.4
loo 2000's 10 0.5
loo 2010's 5 0.2
wf 1960's 10 0.7
wf 1970's 10 0.6
wf 1980's 10 0.3
wf 1990's 10 0.4
wf 2000's 10 0.5
wf 2010's 5 0.0

La La La, I can't hear you

My dataset did not have the 2016 award winners nor the 2017 nominees. As the data requirements for the conomination model are simple enough to gather from the AMPAS webpage, I put them together here and run them through our fit model to get estimated probabilities of winning the Best Picture. It looks like 'La La Land' is the clear favorite under this model with around 55% chance of winning, with 'Hacksaw Ridge' and 'Manchester by the Sea' around 10%. A quick scan of the betting markets confirms that 'La La Land' is the frontunner, so no real surprises here. However, the 100 to 1 odds quoted for 'Hacksaw Ridge' are perhaps not warranted.

library(readr)
library(dplyr)
bpdf <- readr::read_csv('../data/best_picture_2.csv') %>%
    distinct(year,id,.keep_all=TRUE) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_')) %>%
    filter(year > 1945)
## Rows: 595 Columns: 55
## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (5): film, category, etc, imdb_index, title
## dbl (36): year, id, movie_id, ttid, production_year, votes, vote_mean, vote_...
## lgl (14): nominated_for_Writing, nominated_for_BestOriginalScore, nominated_...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
source('../code/oslm.R')

# the formula we settled on, using co-nominations.
fmla <- winner:year ~ nominated_for_Writing + nominated_for_BestDirector + 
    nominated_for_BestActress + nominated_for_BestActor + 
    nominated_for_BestFilmEditing

library(tibble)
new_data <- tibble::tribble(
        ~film, ~nominated_for_Writing,  ~nominated_for_BestDirector, ~nominated_for_BestActress, ~nominated_for_BestActor, ~nominated_for_BestFilmEditing,
        "Arrival", FALSE, TRUE, FALSE, FALSE, TRUE,
        "Fences",FALSE, FALSE, FALSE, TRUE, FALSE,
        "Hacksaw Ridge",FALSE, TRUE, FALSE, TRUE, TRUE,
        "Hell or High Water", TRUE,  FALSE, FALSE, FALSE, TRUE,
        "Hidden Figures",FALSE, FALSE, FALSE, FALSE, FALSE,
        "La La Land", TRUE, TRUE, TRUE, TRUE, TRUE,
        "Lion",FALSE, FALSE, FALSE, FALSE, FALSE,
        "Manchester by the Sea", TRUE, TRUE, FALSE, TRUE, FALSE,
        "Moonlight",FALSE, TRUE, FALSE, FALSE, TRUE
        ) %>%
    mutate(year=2017) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_'))

amod <- oslm(fmla,bpdf)
newprd <- predict(amod,new_data,has_y=FALSE) %>% cbind(new_data %>% select(-year))

newprd %>% 
    select(year,film,prob) %>%
    arrange(desc(prob)) %>%
    rename(win_prob=prob) %>%
    kable(padding=10,digits=2,
        caption='Estimated probability of winning 2017 Best Picture',format='html')
Estimated probability of winning 2017 Best Picture
year film win_prob
2017 La La Land 0.56
2017 Hacksaw Ridge 0.15
2017 Manchester by the Sea 0.10
2017 Arrival 0.08
2017 Moonlight 0.08
2017 Hell or High Water 0.03
2017 Fences 0.00
2017 Hidden Figures 0.00
2017 Lion 0.00