gilgamath

## Predicting Best Picture Winners.

Thu 26 January 2017 by Steven E. Pav

In a previous blog post, I described some data I had put together for predicting winners in the Best Picture category of the Oscars. Here I will use a Bradley-Terry model to describe this dataset.

To test these models, I wrote an R function called `oslm`. I have posted the code. This code allows one to model the likelihood of winning an award as a function of some independent variables on each film, taking into account that one and only one film wins in a given year. The code supports computation of the likelihood function (and gradient and Hessian), and allows the `maxLik` package to do the heavy lifting. Supposing one has a data frame with boolean column `winner` to denote winners and `year` to hold the award year, and some independent variables, say `x1`, `x2`, and so on. Then one can invoke this code as

```modl <- oslm(winner:year ~ x1 + x2,data=my_dataframe)
```

This is a bit heterodox, using the colon in the left hand side. However, I wasn't sure where else to put it, and the code was not too vile to write. Since I did not know the name of this model, I did not know what existing packages supported this kind of analysis, so I wrote my own silly function.

## Who's a winner?

Let's use this data and code instead of talking about it. First, I load the data and then rid it of duplicates (sorry about those), and convert some Boolean independent variables to numeric. I source the `oslm` code and then try a very simple model: looking at films from 1950 onward, can I predict the probability of winning Best Picture in terms of the (log of the) number of votes it receives on IMDb, stored in the `votes` variable:

```library(readr)
library(dplyr …```

## Best Picture?

Sun 22 January 2017 by Steven E. Pav

For a brief time I found myself working in the field of film analytics. One of our mad scientist type projects at the time was trying to predict which films might win an award. As a training exercise, we decided to analyze the Oscars.

With such a great beginning, you might be surprised to find the story does not end well. Collecting the data for such an analysis was a minor endeavor. At the time we had scraped and cobbled together a number of different databases about films, but connecting them to each other was a huge frustration. Around the time we would have been predicting the Oscars, the floor fell out from our funding and we were unemployed three weeks after they announced the Oscar 2015 winners.

Our loss is your gain, as I am now releasing the first cut of the data frame I was using. The data are available in a CSV file here. The columns are as follows:

• `year` is the year of the Oscars.
• `category` should always be `Best Picture` here.
• `film` is the title.
• `etc` is extra information to identify the film.
• `winner` is a Boolean for whether the film won in that category.
• `id` and `movie_id` are internal IDs, and have no use for you.
• `ttid` is the best guess for the IMDb 'tt ID'.
• `title` and `production_year` are from the IMDb data.
• `votes` are the total number of votes in IMDb for the IMDb film. (This is an old cut of the data.)
• `vote_mean`, `vote_sd` are the mean and standard deviation of user votes for the film in IMDb.
• `vote1` and `vote10` are the proportion of 1- and 10-star votes for the film in IMDb.
• I do not remember what `series` is.
• `total_gross` is one estimate of gross receipts, and `bom` is …

## Do You Want Ballot Stuffing With Your Turkey?

Wed 07 December 2016 by Steven E. Pav

Rather than fade into ignominy as a lesser Ralph Nader, Jill Stein has managed to twist the knife in the still-bleeding Left, fleecing a couple of million from some disoriented voters for a recount of the election in Wisconsin, Michigan, and Pennsylvania. While a recount seems like a less likely path to victory for Clinton than, say, a revolt of the Electoral College, or the Donald pulling an Andy Kaufman, perhaps it should be undertaken if there is any evidence of fraud. Recall that prior to the election (and since!) we were warned of the possibility of 'massive voter fraud'. I am not familiar with the legal argument for a recount, but was curious if there is a statistical argument for one. I pursue a simple analysis here.

The arguments that I have heard for a recount (other than the danger to our republic from giving power to mentally unstable blowhard, but I will try to keep my political bias out of here) sounded pretty weak, as they could easily be explained away by an omitted variable. For example, arguments of the form "Trump outperformed Clinton in counties with electronic voting machines," even if couched in a 'proper' statistical test, are likely to be assuming independence of those events, when they need not be independent for numerous reasons.

Instead, I will fall back here to a weaker analysis, based on Benford's Law. Benford's Law, which is more of a stylized fact, states that the leading digit of naturally occurring collections of numbers should follow a certain distribution. Apparently this method was used to detect suspicious patterns in the 2009 Iranian elections, so you would expect only an amateur ballot-stuffer would expose themselves to this kind of diagnostic.

First I grab the ward by ward Wisconsin voter data. This set is …

## Forty K

Fri 29 July 2016 by Steven

A milestone