Analyze This: Robert De Niro and IMDb Average Ratings
Sat 09 July 2016
by
Steven E. Pav
I recently saw a plot purporting to show the Rotten Tomatoes' 'freshness'
rating of Robert De Niro movies over the years, with some chart fluff
suggesting 'Bobby' has essentially been phoning it in since 2002, at
age 59. Somehow I wrote and maintain a
mirror of IMDb which would be
well suited to explore questions of this kind. Since I am inherently
a skeptical person, I decided to look for myself.
You talkin' to me?
First, we grab the 'acts in' table from the MariaDB via dplyr
. I found
that working with dplyr
allowed me to very quickly switch between in-database
processing and 'real' analysis in R, and I highly recommend it. Then we get
information about De Niro, and join with information about his movies,
and the votes for the same:
library(RMySQL)
library(dplyr)
library(knitr)
# get the connection and set to UTF-8 (probably not necessary here)
dbcon <- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt <- dbGetQuery(dbcon$con,'SET NAMES utf8')
# acts in relation
acts_in <- tbl(dbcon,'cast_info') %>%
inner_join(tbl(dbcon,'role_type') %>%
filter(role %regexp% 'actor|actress'),
by='role_id')
# Robert De Niro, as a person:
bobby <- tbl(dbcon,'name') %>%
filter(name %regexp% 'De Niro, Robert$') %>%
select(name,gender,dob,person_id)
# all movies:
titles <- tbl(dbcon,'title')
# his movies:
all_bobby_movies <- acts_in %>%
inner_join(bobby,by='person_id') %>%
left_join(titles,by='movie_id')
# genre information
movie_genres <- tbl(dbcon,'movie_info') %>%
inner_join(tbl(dbcon,'info_type') %>%
filter(info %regexp% 'genres') %>%
select(info_type_id),
by='info_type_id')
# get rid of _documentaries_ :
bobby_movies <- all_bobby_movies %>%
anti_join(movie_genres %>%
filter(info %regexp% 'Documentary'),by='movie_id')
# get votes for all movies:
vote_info <- tbl(dbcon,'movie_votes') %>%
select(movie_id,votes,vote_mean,vote_sd,vote_se)
# votes for De Niro movies:
bobby_votes <- bobby_movies %>%
inner_join(vote_info,by='movie_id')
# now collect them:
bv <- bobby_votes %>% collect()
# sort it
bv <- bv %>%
distinct(movie_id,.keep_all=TRUE …
read more
Overfit Like a Pro
Tue 24 May 2016
by
Steven E. Pav
Earlier this year, I participated in the
Winton Stock Market Challenge
on Kaggle. I wanted to explore the freely available
tools in R for performing what I had routinely done in Matlab
in my previous career, I was curious how a large
investment management firm (and Kagglers)
approached this problem, and I wanted to be eyewitness to a potential
overfitting disaster, should one occur.
The setup should be familiar: for selected date, stock pairs you are
given 25 state variables, the two previous days of returns, and the
first 120 minutes of returns. You are to predict the remaining 60 minutes
of returns of that day and the following two days of returns for the stock.
The metric used to score your predictions is a
weighted mean absolute error,
where presumably higher volatility names are downweighted in the final
error metric. The training data consist of 40K observations, while
the test data consist of 120K rows, for which one had to produce 744K
predictions. First prize was a cool $20K. In addition to the prizes,
Winton was explicitly looking for resumes.
I suspected that this competition would provide valuable data
in my study of human overfitting of trading strategies. Towards
that end, let us gather the public and private
leaderboards.
Recall that the public leaderboard is what participants see of
their submissions during the competition period, based on around
one quarter of the test set data, while the private leaderboard
is the score of predictions on the remaining part of the test data,
and is published in a big reveal at the close of the competition.
Let's gather the leaderboard data.
(Those of you who want to play along at home can download
my cut of the data.)
library(dplyr)
library(rvest)
# a function to load and process a leaderboard …
read more
CRAN check like a bot with docker.
Tue 08 March 2016
by
Steven
If you're like me, you just blindly check boxes when submitting packages to CRAN. (The
'submit' button should be labeled 'yolo' as far as I'm concerned.) After getting
burned yet again for not actually checking my package with the development build
of R, I decided to be slightly less stupid in the future. Rather than install
R-devel, I made a docker base image
for CRAN checking.
As an example, to check my sadists package,
I made essentially the following Dockerfile:
# preamble#
FROM shabbychef/crancheck
MAINTAINER Steven E. Pav, shabbychef@gmail.com
# tweak this to force re-install
ENV DOCKER_INSTALL_NONCE 97c22800_9f88_4830_806a_2614e06600f2
# rinstall somethings...
RUN /usr/local/bin/install2.r PDQutils hypergeo orthopolynom shiny testthat ggplot2 xtable knitr
It starts FROM
the crancheck image on docker hub. The general recipe would be to install any
system packages via apt-get
, then any CRAN packages via install2.r
, then any github packages
via /usr/local/bin/installGithub.r
. The base image 'does the right thing' with respect to the
entrypoint and you give the package file as the command.
I built it via:
docker build --rm -t shabbychef/sadists-crancheck docker/
Once the image is built, checking a package is as 'simple' as attaching the local directory
as /srv
in the container via a volume, and giving the name of the package file. (That is,
when the command to the container is sadists_0.2.2.5000.tar.gz
, it will try to check, as CRAN,
the file /srv/sadists_0.2.2.5000.tar.gz
. You had better make sure it is available there,
so attach this directory here containing the package to /srv
in the container.)
In summary, run it like this:
docker run -it --rm --volume $(pwd):/srv:ro shabbychef/sadists-crancheck sadists_0.2.2.5000.tar.gz
You get output as follows:
read more
It's Madness!
Sat 02 January 2016
by
Steven E. Pav
I recently released a package to CRAN called
madness. The eponymous object
supports 'multivariate' automatic
differentiation by forward accumulation. By 'multivariate', I mean it allows
you to track (and automatically computes) the derivative of a scalar, or
vector, or matrix, or multidimensional array with respect to a scalar, vector,
matrix or multidimensional array.
The primary use case in mind is the
multivariate delta method,
where one has an estimate of a population quantity and the variance-covariance
of the same, and wants to perform inference on some transform of that
population quantity. With the values stored in a madness
object, one merely
performs the transforms directly on the estimate, and the derivatives are
computed automatically. A secondary use case would be for the automatic
computation of gradients when optimizing some complex function, e.g. in the
computation of the MLE of some quantity.
A madness
object contains a value, val
, as well as the derivative of
val
with respect to some \(X\), called dvdx
. The derivative is stored
as a matrix in 'numerator layout' convention: if val
holds
\(m\) values, and \(X\) holds \(n\) values, then dvdx
is a \(m \times n\) matrix.
This unfortunately means that a gradient is stored as a row vector.
Numerator layout feels more natural (to me, at least) when propagating
derivatives via the chain rule.
For convenience, one can also store the 'tags' of the value and \(X\), in
vtag
and xtag
, respectively. The vtag
will be modified when computations
are performed, which can be useful for debugging. One can also store
the variance-covariance matrix of \(X\) in varx
.
Here is an example session showing the use of a madness
object. Note that by
default if one does not feed in dvdx
, the object constructor assumes that
the value is equal to \(X\), and so …
read more