Analyze This: Robert De Niro and IMDb Average Ratings
Sat 09 July 2016
by Steven E. Pav
I recently saw a plot purporting to show the Rotten Tomatoes' 'freshness'
rating of Robert De Niro movies over the years, with some chart fluff
suggesting 'Bobby' has essentially been phoning it in since 2002, at
age 59. Somehow I wrote and maintain a
mirror of IMDb which would be
well suited to explore questions of this kind. Since I am inherently
a skeptical person, I decided to look for myself.
You talkin' to me?
First, we grab the 'acts in' table from the MariaDB via
dplyr. I found
that working with
dplyr allowed me to very quickly switch between in-database
processing and 'real' analysis in R, and I highly recommend it. Then we get
information about De Niro, and join with information about his movies,
and the votes for the same:
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# acts in relation
acts_in <- tbl(dbcon,'cast_info') %>%
filter(role %regexp% 'actor|actress'),
# Robert De Niro, as a person:
bobby <- tbl(dbcon,'name') %>%
filter(name %regexp% 'De Niro, Robert$') %>%
# all movies:
titles <- tbl(dbcon,'title')
# his movies:
all_bobby_movies <- acts_in %>%
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
filter(info %regexp% 'genres') %>%
# get rid of _documentaries_ :
bobby_movies <- all_bobby_movies %>%
filter(info %regexp% 'Documentary'),by='movie_id')
# get votes for all movies:
vote_info <- tbl(dbcon,'movie_votes') %>%
# votes for De Niro movies:
bobby_votes <- bobby_movies %>%
# now collect them:
bv <- bobby_votes %>% collect()
# sort it
bv <- bv %>%
Is it Blockbuster Season?
Tue 28 June 2016
by Steven E. Pav
I recently released a
docker-compose-based 'solution' to creating
an IMDb mirror. This was one
by-product of my ill-fated foray into Hollywood. The ETL process:
removes TV shows, straight-to-video, porn, and most hobby projects from
the larger IMDb FTP dump; uses
imdb2sql.py to stuff the data into a database;
then converts some of the text-based data into numeric data.
For sanity checking, and to illustrate basic usage, I look here
at seasonality of gross box office receipts.
Seasonality is a good test case because it is not subtle: you should not need
a fancy statistical test to detect its existence. Seasonality was one of the
features of the industry that the crusty old industry folk (and I say that
with true admiration) could discuss in great detail, with its many subtleties.
At a first order approximation, though, we expect to see a flurry of big budget blockbusters
in the early summer, right as college lets out, higher sales throughout the summer,
then peaks in November and December (again, tied to college breaks).
If you want to play along, you will have to go get the
IMDb mirror, and run it.
This can take upwards of an hour to download (and I suspect that the
bottleneck is not your local internet connection, but rather the
FTP server), and perhaps another hour for the ETL process. When
this was my bread and butter, I worked hard to cut down processing
time. It will not get much faster without a replacement of
imdb2sql.py, or a switch to a non-insane upstream format initiated
by the people at IMDb. (Good luck with that.)
Now, how many movies report opening weekend numbers in the USA, in dollars?
dbcon <- src_mysql(host='0.0.0 …
Overfit Like a Pro
Tue 24 May 2016
by Steven E. Pav
Earlier this year, I participated in the
Winton Stock Market Challenge
on Kaggle. I wanted to explore the freely available
tools in R for performing what I had routinely done in Matlab
in my previous career, I was curious how a large
investment management firm (and Kagglers)
approached this problem, and I wanted to be eyewitness to a potential
overfitting disaster, should one occur.
The setup should be familiar: for selected date, stock pairs you are
given 25 state variables, the two previous days of returns, and the
first 120 minutes of returns. You are to predict the remaining 60 minutes
of returns of that day and the following two days of returns for the stock.
The metric used to score your predictions is a
weighted mean absolute error,
where presumably higher volatility names are downweighted in the final
error metric. The training data consist of 40K observations, while
the test data consist of 120K rows, for which one had to produce 744K
predictions. First prize was a cool $20K. In addition to the prizes,
Winton was explicitly looking for resumes.
I suspected that this competition would provide valuable data
in my study of human overfitting of trading strategies. Towards
that end, let us gather the public and private
Recall that the public leaderboard is what participants see of
their submissions during the competition period, based on around
one quarter of the test set data, while the private leaderboard
is the score of predictions on the remaining part of the test data,
and is published in a big reveal at the close of the competition.
Let's gather the leaderboard data.
(Those of you who want to play along at home can download
my cut of the data.)
# a function to load and process a leaderboard …
R in Finance 2016
Fri 20 May 2016
Review of R in Finance 2016 conference