Odds of Winning Your Oscar Pool.
Mon 30 January 2017
by
Steven E. Pav
In a previous blog post, I used a Bradley-Terry
model to analyze Oscar Best Picture winners, using the
best picture dataset.
In that post I presented the results of likelihood tests
which showed 'significant' relationships between winning the Best
Picture category and conomination for other awards, MovieLens ratings, and
(spuriously) number of IMDb votes. It can be hard to interpret the
effect sizes and \(t\) statistics from a Bradley-Terry model. So here
I will try to estimate the probability of correctly guessing the
Best Picture winner using this model.
There is no apparent direct translation from the coefficients
of the model fit to the probability of correctly forecasting
a winner. Nor can you transform the maximized likelihood, or an
R-squared. Moreover, it will depend on the number of nominees
(traditionally there were only 5 Best Picture nominations--these
days it's upwards of 9), and how they differ in the independent
variables. Here I will keep it simple and use cross validation.
I modified the oslm code to include
a predict
method. So here, I load the data and the code,
and remove duplicates and restrict the data to the period after
1945. I construct the model formula, based on co-nomination, then
test in three ways:
- A purely 'in sample' validation where all the data are
used to build the model, then tested. (The film with the
highest forecast probability of winning is chosen as
the predicted winner, of course.) This should give the
most optimistic view of performance, even though the
likelihood maximization problem does not directly
select for this metric.
- A walk-forward cross validation where the data
up through year \(y-1\) are used to build the model,
then it is used to forecast the winners in year \(y\).
This is perhaps the most honest kind of cross validation
for time …
read more
Predicting Best Picture Winners.
Thu 26 January 2017
by
Steven E. Pav
In a previous blog post, I described some
data I had put together for
predicting winners in the Best Picture category of the Oscars.
Here I will use a
Bradley-Terry model to describe this dataset.
To test these models, I wrote an R function called oslm
. I have
posted the code. This code allows one
to model the likelihood of winning an award as a function of
some independent variables on each film, taking into account
that one and only one film wins in a given year. The code supports
computation of the likelihood function (and gradient and Hessian),
and allows the maxLik
package to do the heavy lifting. Supposing
one has a data frame with boolean column winner
to denote
winners and year
to hold the award year, and some independent
variables, say x1
, x2
, and so on. Then one can invoke this
code as
modl <- oslm(winner:year ~ x1 + x2,data=my_dataframe)
This is a bit heterodox, using the colon in the left hand side. However,
I wasn't sure where else to put it, and the code was not too vile to write.
Since I did not know the name of this model, I did not know what existing
packages supported this kind of analysis, so I wrote my own silly function.
Who's a winner?
Let's use this data and code instead of talking about it. First, I load the
data and then rid it of duplicates (sorry about those), and convert some
Boolean independent variables to numeric. I source the oslm
code and then
try a very simple model: looking at films from 1950 onward, can I predict the
probability of winning Best Picture in terms of the (log of the) number of
votes it receives on IMDb, stored in the votes
variable:
library(readr)
library(dplyr …
read more
Best Picture?
Sun 22 January 2017
by
Steven E. Pav
For a brief time I found myself working in the field of film analytics. One of
our mad scientist type projects at the time was trying to predict which films
might win an award. As a training exercise, we decided to analyze the Oscars.
With such a great beginning, you might be surprised to find the story does
not end well. Collecting the data for such an analysis was a minor endeavor.
At the time we had scraped and cobbled together a number of different databases
about films, but connecting them to each other was a huge frustration. Around
the time we would have been predicting the Oscars, the floor fell out from
our funding and we were unemployed three weeks after they announced the
Oscar 2015 winners.
Our loss is your gain, as I am now releasing the first cut of the data frame
I was using. The data are available in a
CSV file here. The columns are as follows:
year
is the year of the Oscars.
category
should always be Best Picture
here.
film
is the title.
etc
is extra information to identify the film.
winner
is a Boolean for whether the film won in that category.
id
and movie_id
are internal IDs, and have no use for you.
ttid
is the best guess for the IMDb 'tt ID'.
title
and production_year
are from the IMDb data.
votes
are the total number of votes in IMDb for the IMDb film. (This is an old cut of the data.)
vote_mean
, vote_sd
are the mean and standard deviation of user votes for
the film in IMDb.
vote1
and vote10
are the proportion of 1- and 10-star votes for the film
in IMDb.
- I do not remember what
series
is.
total_gross
is one estimate of gross receipts, and bom
is …
read more
Do You Want Ballot Stuffing With Your Turkey?
Wed 07 December 2016
by
Steven E. Pav
Rather than fade into ignominy as a lesser Ralph Nader, Jill Stein has managed
to twist the knife in the still-bleeding Left, fleecing a couple of million
from some disoriented voters for a recount of the election in Wisconsin,
Michigan, and Pennsylvania. While a recount seems like a less likely path
to victory for Clinton than, say, a revolt of the Electoral College, or
the Donald pulling an Andy Kaufman, perhaps it should be undertaken if
there is any evidence of fraud. Recall that prior to the election (and since!)
we were warned of the possibility of 'massive voter fraud'. I am not familiar
with the legal argument for a recount, but was curious if there is a
statistical argument for one. I pursue a simple analysis here.
The arguments that I have heard for a recount (other than the danger to our
republic from giving power to mentally unstable blowhard, but I will try to keep my
political bias out of here) sounded pretty weak, as they could easily be
explained away by an omitted variable. For example, arguments of the form
"Trump outperformed Clinton in counties with electronic voting machines," even
if couched in a 'proper' statistical test, are likely to be assuming
independence of those events, when they need not be independent for numerous
reasons.
Instead, I will fall back here to a weaker analysis, based on
Benford's Law. Benford's Law,
which is more of a stylized fact, states that the leading digit of naturally
occurring collections of numbers should follow a certain distribution.
Apparently this method was used to detect suspicious patterns in the 2009
Iranian elections, so you would expect only an amateur ballot-stuffer would
expose themselves to this kind of diagnostic.
First I grab the ward by ward Wisconsin voter
data.
This set is …
read more