Gilgamath



IMDb Rating by Sex

Thu 21 July 2016 by Steven E. Pav

The nerdosphere is in a minor tizzy over a putative bias in IMDb ratings for the new (2016) Ghostbusters film. It seems a bit odd to me, since IMDb ratings have always been horribly 'biased': If the question you are trying to answer is "If I am forced to watch this randomly selected movie, will I like it?", then IMDb ratings, and most aggregated movie ratings are difficult to interpret, very likely 'biased'. The typical mechanism by which a rating ends up on IMDb is that a person somehow gains an awareness of the film (this has been the major problem for studios since the end of the studio-theatre model seventy years ago), enough so to view the film; they are then more likely to rate the movie if they liked it, or liked it more than expected it, or really hated it. Those who had low to middling opinions of the film are less likely to rate it, and so you have the problem of missing data, without the simplifying assumption of "missing at random."

The Ghostbusters argy bargy (or one of them) is that reviews are suspected to be coming from people who have not seen the movie. This is possibly a problem for all reviews on IMDb, though less so for reviews appearing in streaming services, who know when you have seen a film. (The other argy bargy is that sexist and racist jerks have been harassing stars of the new film.) The analysis on five thirty eight is informative, but uses information (e.g. age and sex of the reviewers) that is not widely available, and which is volunteered by the reviewers. Given the IMDb mirror at my disposal, I can look for systematic biases for films based on sex, and will do so here.

I …

read more

IMDb Rating by Actor Age

Wed 13 July 2016 by Steven E. Pav

I recently looked at IMDb ratings for Robert De Niro movies, finding slight evidence for a dip in ratings in his third act. I noted then that the data were subject to all kinds of selection biases, and that even in a perfect world would only reflect the ratings of movies that De Niro was in, not of his individual performance. I speculated that older actors might no longer be offered parts in good movies. This is something that can be explored via the IMDb mirror at my disposal, but only very weakly: if actors 'stopped caring' after a certain age, or declined in abilities, or even if IMDb raters simply liked movies with more young people, one might see the same patterns in the data. Despite these caveats, let us press on.

That struts and frets his hour upon the stage

First, I collect all movies which are not marked as Documentary in the data, and which have a production year between 1965 and 2015, and have at least 250 votes on IMDb. This does present a selection bias towards better movies in the earlier period we will have to correct for. I then collect actors and actresses with a known date of birth who have featured in at least 30 of these films. I bring them into R via dplyr, and then subselect to observations where the actor was between 18 and 90 in the production year of the film. This should look like a lot of blah blah blah, but you can follow along at home if you have the mirror, which you can install yourself.

library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port …
read more

Analyze This: Robert De Niro and IMDb Average Ratings

Sat 09 July 2016 by Steven E. Pav

I recently saw a plot purporting to show the Rotten Tomatoes' 'freshness' rating of Robert De Niro movies over the years, with some chart fluff suggesting 'Bobby' has essentially been phoning it in since 2002, at age 59. Somehow I wrote and maintain a mirror of IMDb which would be well suited to explore questions of this kind. Since I am inherently a skeptical person, I decided to look for myself.

You talkin' to me?

First, we grab the 'acts in' table from the MariaDB via dplyr. I found that working with dplyr allowed me to very quickly switch between in-database processing and 'real' analysis in R, and I highly recommend it. Then we get information about De Niro, and join with information about his movies, and the votes for the same:

library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# acts in relation
acts_in <- tbl(dbcon,'cast_info') %>%
    inner_join(tbl(dbcon,'role_type') %>% 
        filter(role %regexp% 'actor|actress'),
        by='role_id')
# Robert De Niro, as a person:
bobby <- tbl(dbcon,'name') %>%
    filter(name %regexp% 'De Niro, Robert$') %>%
    select(name,gender,dob,person_id)
# all movies:
titles <- tbl(dbcon,'title') 
# his movies:
all_bobby_movies <- acts_in %>%
    inner_join(bobby,by='person_id') %>%
    left_join(titles,by='movie_id')
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
    inner_join(tbl(dbcon,'info_type') %>% 
        filter(info %regexp% 'genres') %>%
        select(info_type_id),
        by='info_type_id') 
# get rid of _documentaries_ :
bobby_movies <- all_bobby_movies %>% 
    anti_join(movie_genres %>% 
        filter(info %regexp% 'Documentary'),by='movie_id')
# get votes for all movies:
vote_info <- tbl(dbcon,'movie_votes') %>% 
    select(movie_id,votes,vote_mean,vote_sd,vote_se)
# votes for De Niro movies:
bobby_votes <- bobby_movies %>%
    inner_join(vote_info,by='movie_id')
# now collect them:
bv <- bobby_votes %>% collect() 
# sort it
bv <- bv %>% 
    distinct(movie_id,.keep_all=TRUE …
read more

Overfit Like a Pro

Tue 24 May 2016 by Steven E. Pav

Earlier this year, I participated in the Winton Stock Market Challenge on Kaggle. I wanted to explore the freely available tools in R for performing what I had routinely done in Matlab in my previous career, I was curious how a large investment management firm (and Kagglers) approached this problem, and I wanted to be eyewitness to a potential overfitting disaster, should one occur.

The setup should be familiar: for selected date, stock pairs you are given 25 state variables, the two previous days of returns, and the first 120 minutes of returns. You are to predict the remaining 60 minutes of returns of that day and the following two days of returns for the stock. The metric used to score your predictions is a weighted mean absolute error, where presumably higher volatility names are downweighted in the final error metric. The training data consist of 40K observations, while the test data consist of 120K rows, for which one had to produce 744K predictions. First prize was a cool $20K. In addition to the prizes, Winton was explicitly looking for resumes.

I suspected that this competition would provide valuable data in my study of human overfitting of trading strategies. Towards that end, let us gather the public and private leaderboards. Recall that the public leaderboard is what participants see of their submissions during the competition period, based on around one quarter of the test set data, while the private leaderboard is the score of predictions on the remaining part of the test data, and is published in a big reveal at the close of the competition. Let's gather the leaderboard data.
(Those of you who want to play along at home can download my cut of the data.)

library(dplyr)
library(rvest)

# a function to load and process a leaderboard …
read more