IMDb Rating by Actor Age
Wed 13 July 2016
by
Steven E. Pav
I recently looked at IMDb ratings for Robert De Niro movies,
finding slight evidence for a dip in ratings in his third act. I noted then
that the data were subject to all kinds of selection biases, and that even in a
perfect world would only reflect the ratings of movies that De Niro was in,
not of his individual performance. I speculated that older actors might no
longer be offered parts in good movies. This is something that can be explored
via the IMDb mirror at my disposal, but
only very weakly: if actors 'stopped caring' after a certain age, or declined
in abilities, or even if IMDb raters simply liked movies with more young
people, one might see the same patterns in the data. Despite these caveats,
let us press on.
That struts and frets his hour upon the stage
First, I collect all movies which are not marked as Documentary
in the data,
and which have a production year between 1965 and 2015, and have at least 250
votes on IMDb. This does present a selection bias towards better movies in the
earlier period we will have to correct for. I then collect actors and actresses
with a known date of birth who have featured in at least 30 of these films.
I bring them into R via dplyr
, and then subselect to observations where
the actor was between 18 and 90 in the production year of the film. This should
look like a lot of blah blah blah, but you can follow along at home if you
have the mirror, which you can install yourself.
library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port …
read more
Analyze This: Robert De Niro and IMDb Average Ratings
Sat 09 July 2016
by
Steven E. Pav
I recently saw a plot purporting to show the Rotten Tomatoes' 'freshness'
rating of Robert De Niro movies over the years, with some chart fluff
suggesting 'Bobby' has essentially been phoning it in since 2002, at
age 59. Somehow I wrote and maintain a
mirror of IMDb which would be
well suited to explore questions of this kind. Since I am inherently
a skeptical person, I decided to look for myself.
You talkin' to me?
First, we grab the 'acts in' table from the MariaDB via dplyr
. I found
that working with dplyr
allowed me to very quickly switch between in-database
processing and 'real' analysis in R, and I highly recommend it. Then we get
information about De Niro, and join with information about his movies,
and the votes for the same:
library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# acts in relation
acts_in <- tbl(dbcon,'cast_info') %>%
inner_join(tbl(dbcon,'role_type') %>%
filter(role %regexp% 'actor|actress'),
by='role_id')
# Robert De Niro, as a person:
bobby <- tbl(dbcon,'name') %>%
filter(name %regexp% 'De Niro, Robert$') %>%
select(name,gender,dob,person_id)
# all movies:
titles <- tbl(dbcon,'title')
# his movies:
all_bobby_movies <- acts_in %>%
inner_join(bobby,by='person_id') %>%
left_join(titles,by='movie_id')
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
inner_join(tbl(dbcon,'info_type') %>%
filter(info %regexp% 'genres') %>%
select(info_type_id),
by='info_type_id')
# get rid of _documentaries_ :
bobby_movies <- all_bobby_movies %>%
anti_join(movie_genres %>%
filter(info %regexp% 'Documentary'),by='movie_id')
# get votes for all movies:
vote_info <- tbl(dbcon,'movie_votes') %>%
select(movie_id,votes,vote_mean,vote_sd,vote_se)
# votes for De Niro movies:
bobby_votes <- bobby_movies %>%
inner_join(vote_info,by='movie_id')
# now collect them:
bv <- bobby_votes %>% collect()
# sort it
bv <- bv %>%
distinct(movie_id,.keep_all=TRUE …
read more
Overfit Like a Pro
Tue 24 May 2016
by
Steven E. Pav
Earlier this year, I participated in the
Winton Stock Market Challenge
on Kaggle. I wanted to explore the freely available
tools in R for performing what I had routinely done in Matlab
in my previous career, I was curious how a large
investment management firm (and Kagglers)
approached this problem, and I wanted to be eyewitness to a potential
overfitting disaster, should one occur.
The setup should be familiar: for selected date, stock pairs you are
given 25 state variables, the two previous days of returns, and the
first 120 minutes of returns. You are to predict the remaining 60 minutes
of returns of that day and the following two days of returns for the stock.
The metric used to score your predictions is a
weighted mean absolute error,
where presumably higher volatility names are downweighted in the final
error metric. The training data consist of 40K observations, while
the test data consist of 120K rows, for which one had to produce 744K
predictions. First prize was a cool $20K. In addition to the prizes,
Winton was explicitly looking for resumes.
I suspected that this competition would provide valuable data
in my study of human overfitting of trading strategies. Towards
that end, let us gather the public and private
leaderboards.
Recall that the public leaderboard is what participants see of
their submissions during the competition period, based on around
one quarter of the test set data, while the private leaderboard
is the score of predictions on the remaining part of the test data,
and is published in a big reveal at the close of the competition.
Let's gather the leaderboard data.
(Those of you who want to play along at home can download
my cut of the data.)
library(dplyr)
library(rvest)
# a function to load and process a leaderboard …
read more
You Deserve Expensive Champagne ... If You Buy It.
Sat 26 December 2015
by
Steven E. Pav
I received some taster ratings from the champagne party we
attended last week.
I joined the raw ratings with the bottle information to
create a single aggregated dataset.
This is a 'non-normal' form, but simplest to distribute. Here is a taste:
library(dplyr)
library(readr)
library(knitr)
champ <- read_csv('../data/champagne_ratings.csv')
champ %>% select(winery,purchase_price_per_liter,raternum,rating) %>%
head(8) %>% kable(format='markdown')
winery |
purchase_price_per_liter |
raternum |
rating |
Barons de Rothschild |
80.00000 |
1 |
10 |
Onward Petillant Naturel 2014 Malavasia Bianca |
33.33333 |
1 |
4 |
Chandon Rose Method Traditionnelle |
18.66667 |
1 |
8 |
Martini Prosecco from Italy |
21.32000 |
1 |
8 |
Roederer Estate Brut |
33.33333 |
1 |
8 |
Kirkland Asolo Prosecco Superiore |
9.32000 |
1 |
7 |
Champagne Tattinger Brute La Francaise |
46.66667 |
1 |
6 |
Schramsberg Reserver 2001 |
132.00000 |
1 |
6 |
Recall that the rules of the contest dictate that the average rating of each
bottle was computed, then divided by 25 dollars more than the
price (presumably for a 750ml bottle). Depending on whether the average
ratings were compressed around the high end of the zero to ten scale,
or around the low end, one would wager on either the cheapest bottles, or more
moderately priced offerings. (Based on my
previous analysis, I brought the
Menage a Trois Prosecco, rated at 91 points, but available at Safeway for
10 dollars.) It is easy to compute the raw averages using dplyr
:
avrat <- champ %>%
group_by(winery,bottle_num,purchase_price_per_liter) %>%
summarize(avg_rating=mean(rating)) %>%
ungroup() %>%
arrange(desc(avg_rating))
avrat %>% head(8) %>% kable(format='markdown')
winery |
bottle_num |
purchase_price_per_liter |
avg_rating |
Desuderi Jeio |
4 |
22.66667 |
6.750000 |
Gloria Ferrer Sonoma Brut |
19 |
20.00000 |
6.750000 |
Roederer Estate Brut |
12 |
34.66667 |
6.642857 |
Charles Collin Rose |
34 |
33.33333 |
6.636364 |
Roederer Estate Brut |
13 |
33.33333 |
6.500000 |
Gloria Ferrer Sonoma Brut |
11 … |
read more