Gilgamath

Best Picture?

Sun 22 January 2017 by Steven E. Pav

For a brief time I found myself working in the field of film analytics. One of our mad scientist type projects at the time was trying to predict which films might win an award. As a training exercise, we decided to analyze the Oscars.

With such a great beginning, you might be surprised to find the story does not end well. Collecting the data for such an analysis was a minor endeavor. At the time we had scraped and cobbled together a number of different databases about films, but connecting them to each other was a huge frustration. Around the time we would have been predicting the Oscars, the floor fell out from our funding and we were unemployed three weeks after they announced the Oscar 2015 winners.

Our loss is your gain, as I am now releasing the first cut of the data frame I was using. The data are available in a CSV file here. The columns are as follows:

year is the year of the Oscars.
category should always be Best Picture here.
film is the title.
etc is extra information to identify the film.
winner is a Boolean for whether the film won in that category.
id and movie_id are internal IDs, and have no use for you.
ttid is the best guess for the IMDb 'tt ID'.
title and production_year are from the IMDb data.
votes are the total number of votes in IMDb for the IMDb film. (This is an old cut of the data.)
vote_mean, vote_sd are the mean and standard deviation of user votes for the film in IMDb.
vote1 and vote10 are the proportion of 1- and 10-star votes for the film in IMDb.
I do not remember what series is.
total_gross is one estimate of gross receipts, and bom is …

Do You Want Ballot Stuffing With Your Turkey?

Wed 07 December 2016 by Steven E. Pav

Rather than fade into ignominy as a lesser Ralph Nader, Jill Stein has managed to twist the knife in the still-bleeding Left, fleecing a couple of million from some disoriented voters for a recount of the election in Wisconsin, Michigan, and Pennsylvania. While a recount seems like a less likely path to victory for Clinton than, say, a revolt of the Electoral College, or the Donald pulling an Andy Kaufman, perhaps it should be undertaken if there is any evidence of fraud. Recall that prior to the election (and since!) we were warned of the possibility of 'massive voter fraud'. I am not familiar with the legal argument for a recount, but was curious if there is a statistical argument for one. I pursue a simple analysis here.

The arguments that I have heard for a recount (other than the danger to our republic from giving power to mentally unstable blowhard, but I will try to keep my political bias out of here) sounded pretty weak, as they could easily be explained away by an omitted variable. For example, arguments of the form "Trump outperformed Clinton in counties with electronic voting machines," even if couched in a 'proper' statistical test, are likely to be assuming independence of those events, when they need not be independent for numerous reasons.

Instead, I will fall back here to a weaker analysis, based on Benford's Law. Benford's Law, which is more of a stylized fact, states that the leading digit of naturally occurring collections of numbers should follow a certain distribution. Apparently this method was used to detect suspicious patterns in the 2009 Iranian elections, so you would expect only an amateur ballot-stuffer would expose themselves to this kind of diagnostic.

First I grab the ward by ward Wisconsin voter data. This set is …

IMDb Rating by Sex

Thu 21 July 2016 by Steven E. Pav

The nerdosphere is in a minor tizzy over a putative bias in IMDb ratings for the new (2016) Ghostbusters film. It seems a bit odd to me, since IMDb ratings have always been horribly 'biased': If the question you are trying to answer is "If I am forced to watch this randomly selected movie, will I like it?", then IMDb ratings, and most aggregated movie ratings are difficult to interpret, very likely 'biased'. The typical mechanism by which a rating ends up on IMDb is that a person somehow gains an awareness of the film (this has been the major problem for studios since the end of the studio-theatre model seventy years ago), enough so to view the film; they are then more likely to rate the movie if they liked it, or liked it more than expected it, or really hated it. Those who had low to middling opinions of the film are less likely to rate it, and so you have the problem of missing data, without the simplifying assumption of "missing at random."

The Ghostbusters argy bargy (or one of them) is that reviews are suspected to be coming from people who have not seen the movie. This is possibly a problem for all reviews on IMDb, though less so for reviews appearing in streaming services, who know when you have seen a film. (The other argy bargy is that sexist and racist jerks have been harassing stars of the new film.) The analysis on five thirty eight is informative, but uses information (e.g. age and sex of the reviewers) that is not widely available, and which is volunteered by the reviewers. Given the IMDb mirror at my disposal, I can look for systematic biases for films based on sex, and will do so here.

I …

IMDb Rating by Actor Age

Wed 13 July 2016 by Steven E. Pav

I recently looked at IMDb ratings for Robert De Niro movies, finding slight evidence for a dip in ratings in his third act. I noted then that the data were subject to all kinds of selection biases, and that even in a perfect world would only reflect the ratings of movies that De Niro was in, not of his individual performance. I speculated that older actors might no longer be offered parts in good movies. This is something that can be explored via the IMDb mirror at my disposal, but only very weakly: if actors 'stopped caring' after a certain age, or declined in abilities, or even if IMDb raters simply liked movies with more young people, one might see the same patterns in the data. Despite these caveats, let us press on.

That struts and frets his hour upon the stage

First, I collect all movies which are not marked as Documentary in the data, and which have a production year between 1965 and 2015, and have at least 250 votes on IMDb. This does present a selection bias towards better movies in the earlier period we will have to correct for. I then collect actors and actresses with a known date of birth who have featured in at least 30 of these films. I bring them into R via dplyr, and then subselect to observations where the actor was between 18 and 90 in the production year of the film. This should look like a lot of blah blah blah, but you can follow along at home if you have the mirror, which you can install yourself.

library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port …