In a previous blog post, I described some data I had put together for predicting winners in the Best Picture category of the Oscars. Here I will use a Bradley-Terry model to describe this dataset.

To test these models, I wrote an R function called oslm. I have posted the code. This code allows one to model the likelihood of winning an award as a function of some independent variables on each film, taking into account that one and only one film wins in a given year. The code supports computation of the likelihood function (and gradient and Hessian), and allows the maxLik package to do the heavy lifting. Supposing one has a data frame with boolean column winner to denote winners and year to hold the award year, and some independent variables, say x1, x2, and so on. Then one can invoke this code as

modl <- oslm(winner:year ~ x1 + x2,data=my_dataframe)

This is a bit heterodox, using the colon in the left hand side. However, I wasn't sure where else to put it, and the code was not too vile to write. Since I did not know the name of this model, I did not know what existing packages supported this kind of analysis, so I wrote my own silly function.

Who's a winner?

Let's use this data and code instead of talking about it. First, I load the data and then rid it of duplicates (sorry about those), and convert some Boolean independent variables to numeric. I source the oslm code and then try a very simple model: looking at films from 1950 onward, can I predict the probability of winning Best Picture in terms of the (log of the) number of votes it receives on IMDb, stored in the votes variable:

library(readr)
library(dplyr)
bpdf <- readr::read_csv('../data/best_picture_2.csv') %>%
    distinct(year,id,.keep_all=TRUE) %>%
    mutate_each(funs(as.numeric),matches('^nominated_for_'))
## Rows: 595 Columns: 55
## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (5): film, category, etc, imdb_index, title
## dbl (36): year, id, movie_id, ttid, production_year, votes, vote_mean, vote_...
## lgl (14): nominated_for_Writing, nominated_for_BestOriginalScore, nominated_...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
source('../code/oslm.R')

mod0 <- oslm(winner:year ~ I(log10(votes)),bpdf %>% filter(year >= 1950))
print(summary(mod0))
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 12 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -83.0476 
## 1  free parameters
## Estimates:
##                       Std. error        
## I(log10(votes)) 2.080      0.354 5.873 0
## --------------------------------------------

That summary output gives the parameter estimate, standard error, marginal Wald score, and p-value of the same. The value here, 2ish, indicates the probability of winning the Best Picture is boosted by a factor of 2 for every 10x increase in the number of IMDb votes. (I am being vague about this 'boost': first assign each film the same protoprobability, then add 2 times log base 10 of their IMDb vote tally to the log protoprobabilities for each film, then normalize those protoprobabilities to sum to one. The boost is close to the traditional logistic 'linear increase in log odds', but there is more than one possible outcome.)

Of course, that's a silly way to predict which film will win the award: for many of the films here, and certainly for all of those produced prior to 1990, the act of rating the film on IMDb was influenced by whether the film won the Best Picture. The causality is completely reversed. So while this is a good sanity check, it is not a good way to forecast award winners.

Other variables in my dataset which fall under the same prohibition include all of the IMDb vote information, and probably also the box office information (which is spotty even for recent films, and non-existent for older films). This leaves the 'also nominated for' data, which go back to the 1940s, the genre information, which seems weak, and the rating information from MovieLens. I used the MovieLens sample data, to aggregate the ratings for films where the timestamp of the rating action was prior to January 5th of the year in which the awards are nominated, which is to say prior to the nomination and the awards ceremony. These data do not 'leak' future information, though unfortunately they only go back to about 1996, and are probably based on relatively few votes in the early period.

Apparently the 'conominated' idea is not new. I only just now discovered Iain Pardoe's work in this area, which uses a Bradley Terry model and conomination, plus more fancy techniques. I wish I had seen this 18 months ago, as it would have saved me a lot of effort. Let us replicate Iain's most basic analysis:

mod1 <- oslm(winner:year ~ nominated_for_Writing +
    nominated_for_BestOriginalScore + nominated_for_BestDirector +
    nominated_for_BestActress + nominated_for_BestActor +
    nominated_for_BestFilmEditing + nominated_for_BestSupportingActress +
    nominated_for_BestSupportingActor,bpdf %>% filter(year >= 1950))
print(summary(mod1))
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 30 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -69.2196 
## 8  free parameters
## Estimates:
##                                            Std. error            
## nominated_for_Writing                1.390      0.838  1.658 0.10
## nominated_for_BestOriginalScore      0.546      0.330  1.653 0.10
## nominated_for_BestDirector           2.200      0.760  2.894 0.00
## nominated_for_BestActress            0.233      0.385  0.606 0.54
## nominated_for_BestActor              0.576      0.330  1.745 0.08
## nominated_for_BestFilmEditing        1.592      0.450  3.534 0.00
## nominated_for_BestSupportingActress -0.293      0.339 -0.863 0.39
## nominated_for_BestSupportingActor    0.759      0.340  2.231 0.03
## --------------------------------------------

Holy Stromboli, we got some significance here! Being conominated for best director, or film editing or writing gives a large increase in protoprobabilities. In terms of significance, film editing, director and supporting actor nominations seem defensibly non-zero. It is regrettable that the best supporting actress conomination decreases your odds of winning Best Picture.

The Hoi Polloi

Finally I get to the MovieLens data. Again, this only goes back to 1996, and some films may be missing. I subset for this time period and substitute missings for the mean or median of that year's values. I then build a model using oslm:

mldf <- bpdf %>% 
    filter(year > 1996L) %>% 
    mutate(ml_count=as.numeric(ml_count)) %>%
    group_by(year) %>%
    mutate(ml_rating=ifelse(is.na(ml_rating),mean(ml_rating,na.rm=TRUE),ml_rating),
        ml_count=ifelse(is.na(ml_count),median(ml_count,na.rm=TRUE),ml_count)) %>%
    ungroup()

mod2 <- oslm(winner:year ~ ml_rating + I(log10(ml_count)),mldf)
print(summary(mod2))
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 9 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -28.2914 
## 2  free parameters
## Estimates:
##                           Std. error            
## ml_rating          3.5993     1.4019 2.5675 0.01
## I(log10(ml_count)) 0.0447     0.4606 0.0971 0.92
## --------------------------------------------

So MovieLens average rating is a strong predictor of winning the best picture award, with a coefficient of 3.6, and a t-stat of around 2.5. The log of ratings count did not predict winners, unlike the 'backwards' IMDb vote counts.

An interesting question is whether the MovieLens ratings provide independent information from the conomination data. That is, if you throw all the spaghetti at the wall, will conominations still stick. Easy to check:

mod3 <- oslm(winner:year ~ ml_rating + nominated_for_Writing +
    nominated_for_BestOriginalScore + nominated_for_BestDirector +
    nominated_for_BestActress + nominated_for_BestActor +
    nominated_for_BestFilmEditing + nominated_for_BestSupportingActress +
    nominated_for_BestSupportingActor,mldf)
print(summary(mod3))
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 19 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -19.7589 
## 9  free parameters
## Estimates:
##                                            Std. error            
## ml_rating                            2.970      1.807  1.643 0.10
## nominated_for_Writing               -0.517      1.312 -0.394 0.69
## nominated_for_BestOriginalScore      0.631      0.701  0.900 0.37
## nominated_for_BestDirector           0.813      1.228  0.662 0.51
## nominated_for_BestActress            1.069      0.786  1.360 0.17
## nominated_for_BestActor              0.355      0.675  0.526 0.60
## nominated_for_BestFilmEditing        2.384      1.239  1.925 0.05
## nominated_for_BestSupportingActress  0.693      0.784  0.884 0.38
## nominated_for_BestSupportingActor    1.080      0.669  1.613 0.11
## --------------------------------------------

It appears that the MovieLens rating data 'explains away' the effect of director, writing, actor conominations, with perhaps film editing still a significant contributing factor. However, we have shortened our dataset down from about 65 years of data to about 18. While these 18 years are more recent, and thus might better explain the current state of Oscar voters, the sample size is cut dramatically and 'statistical significance' should be harder to achieve.

By Genre

Finally, is there any bias in the awards towards certain genres? From the larger pool of all films, this is almost certainly the case, but conditional on being nominated, one suspects perhaps not. This can easily be checked on the large sample. Note that here I have collected the IMDb genres for each film, then equally distributed a total weight of 1 to each of a film's genres. A pure drama would get a 1.0 in Drama and a 0 for all the other genres, while a romance-comedy would get 0.5 in both Comedy and Romance and 0 elsewhere. Survey says:

mod4 <- oslm(winner:year ~ Action + Adventure + Comedy + Drama +
     Fantasy + Musical + Romance,bpdf %>% filter(year >= 1950))
print(summary(mod4))
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 20 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -105.13 
## 7  free parameters
## Estimates:
##                   Std. error             
## Action    -1.0714     2.2679 -0.4724 0.64
## Adventure -0.0353     1.4638 -0.0241 0.98
## Comedy     0.0925     0.9632  0.0960 0.92
## Drama     -0.4751     0.6051 -0.7851 0.43
## Fantasy   -4.2469     3.5293 -1.2033 0.23
## Musical    2.5902     1.6538  1.5662 0.12
## Romance   -0.8958     0.8736 -1.0254 0.31
## --------------------------------------------

While the effect sizes are large for some of these (and even larger for Horror and Sci-Fi, which I omitted here), the standard errors are also fairly large, and it is difficult to say whether there is a genre effect. I suspect it is weak at best, since most pictures in the Best Picture category fall solidly in the Drama category.

So there you have it: conominations are a strong indicator for winning the Best Picture award, viewer ratings can be helpful, genre information seems weak, and IMDb vote data show the opposite causality.