For a brief time I found myself working in the field of film analytics. One of our mad scientist type projects at the time was trying to predict which films might win an award. As a training exercise, we decided to analyze the Oscars.
With such a great beginning, you might be surprised to find the story does not end well. Collecting the data for such an analysis was a minor endeavor. At the time we had scraped and cobbled together a number of different databases about films, but connecting them to each other was a huge frustration. Around the time we would have been predicting the Oscars, the floor fell out from our funding and we were unemployed three weeks after they announced the Oscar 2015 winners.
Our loss is your gain, as I am now releasing the first cut of the data frame I was using. The data are available in a CSV file here. The columns are as follows:
year
is the year of the Oscars.category
should always beBest Picture
here.film
is the title.etc
is extra information to identify the film.winner
is a Boolean for whether the film won in that category.id
andmovie_id
are internal IDs, and have no use for you.ttid
is the best guess for the IMDb 'tt ID'.title
andproduction_year
are from the IMDb data.votes
are the total number of votes in IMDb for the IMDb film. (This is an old cut of the data.)vote_mean
,vote_sd
are the mean and standard deviation of user votes for the film in IMDb.vote1
andvote10
are the proportion of 1- and 10-star votes for the film in IMDb.- I do not remember what
series
is. total_gross
is one estimate of gross receipts, andbom
is 'box office in millions'. Note these include receipts after a film has won an Oscar.- Then we have partitions: each film is given weight one which is equally divided among its listed genres in IMDb.
- Then we have
ml_count
andml_rating
. These come from the Movie Lens database, and are the total count and average rating of the films. I have filtered ratings by timestamps to have been submitted to the Movie Lens site prior to January 5 of the year following the Oscar year, which is to say prior to the announcement of Oscar contenders.
In a followup blog post, I will look at this data using the Bradley-Terry type model.
Edit I realized there is some extra data I wanted to add to this set.
Using my database of Oscar nominations, I added fields to this CSV to indicate
whether the same film had nominations for other awards in the same Oscar cycle.
Thus there are now also the following Boolean columns:
nominated_for_Writing
,
nominated_for_BestOriginalScore
,
nominated_for_BestCinematography
,
nominated_for_BestSoundEditing
,
nominated_for_BestArtDireclion
,
nominated_for_BestDirector
,
nominated_for_BestActress
,
nominated_for_BestActor
,
nominated_for_BestOriginalSong
,
nominated_for_BestFilmEditing
,
nominated_for_BestCostumeDesign
,
nominated_for_BestSupportingActress
, and
nominated_for_BestSupportingActor
.
Edit Again in adding the second dataset, I managed to duplicate some rows
during my join. I believe this may be due to some films receiving more than
one nomination for another award and me being sloppy. To remedy this, you can
select by distinct year
and id
. In dplyr
it would look as follows:
library(dplyr)
out_df <- indf %>%
distinct(year,id,.keep_all=TRUE)