Gilgamath

Analyze This: Robert De Niro and IMDb Average Ratings

Sat 09 July 2016 by Steven E. Pav

I recently saw a plot purporting to show the Rotten Tomatoes' 'freshness' rating of Robert De Niro movies over the years, with some chart fluff suggesting 'Bobby' has essentially been phoning it in since 2002, at age 59. Somehow I wrote and maintain a mirror of IMDb which would be well suited to explore questions of this kind. Since I am inherently a skeptical person, I decided to look for myself.

You talkin' to me?

First, we grab the 'acts in' table from the MariaDB via dplyr. I found that working with dplyr allowed me to very quickly switch between in-database processing and 'real' analysis in R, and I highly recommend it. Then we get information about De Niro, and join with information about his movies, and the votes for the same:

library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# acts in relation
acts_in <- tbl(dbcon,'cast_info') %>%
    inner_join(tbl(dbcon,'role_type') %>% 
        filter(role %regexp% 'actor|actress'),
        by='role_id')
# Robert De Niro, as a person:
bobby <- tbl(dbcon,'name') %>%
    filter(name %regexp% 'De Niro, Robert$') %>%
    select(name,gender,dob,person_id)
# all movies:
titles <- tbl(dbcon,'title') 
# his movies:
all_bobby_movies <- acts_in %>%
    inner_join(bobby,by='person_id') %>%
    left_join(titles,by='movie_id')
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
    inner_join(tbl(dbcon,'info_type') %>% 
        filter(info %regexp% 'genres') %>%
        select(info_type_id),
        by='info_type_id') 
# get rid of _documentaries_ :
bobby_movies <- all_bobby_movies %>% 
    anti_join(movie_genres %>% 
        filter(info %regexp% 'Documentary'),by='movie_id')
# get votes for all movies:
vote_info <- tbl(dbcon,'movie_votes') %>% 
    select(movie_id,votes,vote_mean,vote_sd,vote_se)
# votes for De Niro movies:
bobby_votes <- bobby_movies %>%
    inner_join(vote_info,by='movie_id')
# now collect them:
bv <- bobby_votes %>% collect() 
# sort it
bv <- bv %>% 
    distinct(movie_id,.keep_all=TRUE …

Overfit Like a Pro

Tue 24 May 2016 by Steven E. Pav

Earlier this year, I participated in the Winton Stock Market Challenge on Kaggle. I wanted to explore the freely available tools in R for performing what I had routinely done in Matlab in my previous career, I was curious how a large investment management firm (and Kagglers) approached this problem, and I wanted to be eyewitness to a potential overfitting disaster, should one occur.

The setup should be familiar: for selected date, stock pairs you are given 25 state variables, the two previous days of returns, and the first 120 minutes of returns. You are to predict the remaining 60 minutes of returns of that day and the following two days of returns for the stock. The metric used to score your predictions is a weighted mean absolute error, where presumably higher volatility names are downweighted in the final error metric. The training data consist of 40K observations, while the test data consist of 120K rows, for which one had to produce 744K predictions. First prize was a cool $20K. In addition to the prizes, Winton was explicitly looking for resumes.

I suspected that this competition would provide valuable data in my study of human overfitting of trading strategies. Towards that end, let us gather the public and private leaderboards. Recall that the public leaderboard is what participants see of their submissions during the competition period, based on around one quarter of the test set data, while the private leaderboard is the score of predictions on the remaining part of the test data, and is published in a big reveal at the close of the competition. Let's gather the leaderboard data.
(Those of you who want to play along at home can download my cut of the data.)

library(dplyr)
library(rvest)

# a function to load and process a leaderboard …

You Deserve Expensive Champagne ... If You Buy It.

Sat 26 December 2015 by Steven E. Pav

I received some taster ratings from the champagne party we attended last week. I joined the raw ratings with the bottle information to create a single aggregated dataset. This is a 'non-normal' form, but simplest to distribute. Here is a taste:

library(dplyr)
library(readr)
library(knitr)
champ <- read_csv('../data/champagne_ratings.csv')
champ %>% select(winery,purchase_price_per_liter,raternum,rating) %>% 
    head(8) %>% kable(format='markdown')

winery	purchase_price_per_liter	raternum	rating
Barons de Rothschild	80.00000	1	10
Onward Petillant Naturel 2014 Malavasia Bianca	33.33333	1	4
Chandon Rose Method Traditionnelle	18.66667	1	8
Martini Prosecco from Italy	21.32000	1	8
Roederer Estate Brut	33.33333	1	8
Kirkland Asolo Prosecco Superiore	9.32000	1	7
Champagne Tattinger Brute La Francaise	46.66667	1	6
Schramsberg Reserver 2001	132.00000	1	6

Recall that the rules of the contest dictate that the average rating of each bottle was computed, then divided by 25 dollars more than the price (presumably for a 750ml bottle). Depending on whether the average ratings were compressed around the high end of the zero to ten scale, or around the low end, one would wager on either the cheapest bottles, or more moderately priced offerings. (Based on my previous analysis, I brought the Menage a Trois Prosecco, rated at 91 points, but available at Safeway for 10 dollars.) It is easy to compute the raw averages using dplyr:

avrat <- champ %>% 
    group_by(winery,bottle_num,purchase_price_per_liter) %>%
    summarize(avg_rating=mean(rating)) %>%
    ungroup() %>%
    arrange(desc(avg_rating))
avrat %>% head(8) %>% kable(format='markdown')

winery	bottle_num	purchase_price_per_liter	avg_rating
Desuderi Jeio	4	22.66667	6.750000
Gloria Ferrer Sonoma Brut	19	20.00000	6.750000
Roederer Estate Brut	12	34.66667	6.642857
Charles Collin Rose	34	33.33333	6.636364
Roederer Estate Brut	13	33.33333	6.500000
Gloria Ferrer Sonoma Brut	11 …

Champagne Party

Thu 17 December 2015 by Steven E. Pav

We have been invited to a champagne tasting party and competition. The rules of the contest are as follows: partygoers bring a bottle of champagne to share. They taste, then rate the different champagnes on offer, with ratings on a scale of 1 through 10. The average rating is computed for each bottle, then divided by the price (plus some offset) to arrive at an adjusted quality score. The champagne with the highest score nets a prize, and considerable bragging rights, for its owner. Presumably the offset is introduced to prevent small denominators from dominating the rating, and is advertised to have a value of around $25. The 'price' is, one infers, for a standard 750 ml bottle.

I decided to do my homework for a change, rather than SWAG it. I have been doing a lot of web scraping lately, so it was pretty simple to gather some data on champagnes from wine dot com. This file includes the advertised and sale prices, as well as advertised ratings from Wine Spectator (WS), Wine Enthusiast (WE), and so on. Some of the bottles are odd sizes, so I compute the cost per liter as well. (By the way, many people would consider the data collection the hard part of the problem. rvest made it pretty easy, though.) Here's a taste:

library(dplyr)
library(magrittr)
champ <- read.csv('../data/champagne.csv')
champ %>% arrange(price_per_liter) %>% head(10) %>% kable(format='markdown')

name	price	sale_price	WS	WE	WandS	WW	TP	JS	ST	liters	price_per_liter
Pol Clement Rose Sec	8.99	NA	NA	NA	NA	NA	NA	NA	NA	0.75	12.0
Freixenet Carta Nevada Brut	8.99	NA	NA	NA	NA	NA	NA	NA	NA	0.75	12.0
Wolf Blass Yellow Label Brut	8.99	NA	NA	NA	NA	NA	NA	NA …