gilgamath


Is it Blockbuster Season?

Tue 28 June 2016 by Steven E. Pav

I recently released a docker-compose-based 'solution' to creating an IMDb mirror. This was one by-product of my ill-fated foray into Hollywood. The ETL process: removes TV shows, straight-to-video, porn, and most hobby projects from the larger IMDb FTP dump; uses imdb2sql.py to stuff the data into a database; then converts some of the text-based data into numeric data. For sanity checking, and to illustrate basic usage, I look here at seasonality of gross box office receipts.

Seasonality is a good test case because it is not subtle: you should not need a fancy statistical test to detect its existence. Seasonality was one of the features of the industry that the crusty old industry folk (and I say that with true admiration) could discuss in great detail, with its many subtleties. At a first order approximation, though, we expect to see a flurry of big budget blockbusters in the early summer, right as college lets out, higher sales throughout the summer, then peaks in November and December (again, tied to college breaks).

The Data

If you want to play along, you will have to go get the IMDb mirror, and run it. This can take upwards of an hour to download (and I suspect that the bottleneck is not your local internet connection, but rather the FTP server), and perhaps another hour for the ETL process. When this was my bread and butter, I worked hard to cut down processing time. It will not get much faster without a replacement of imdb2sql.py, or a switch to a non-insane upstream format initiated by the people at IMDb. (Good luck with that.)

Now, how many movies report opening weekend numbers in the USA, in dollars?

#d8f5b3ee-a64a-4a7d-8dee-64f19325cfdb 
library(RMySQL)
library(dbplyr)
library(dplyr)
library(knitr)
dbcon <- src_mysql(host='0.0.0 …
read more

Overfit Like a Pro

Tue 24 May 2016 by Steven E. Pav

Earlier this year, I participated in the Winton Stock Market Challenge on Kaggle. I wanted to explore the freely available tools in R for performing what I had routinely done in Matlab in my previous career, I was curious how a large investment management firm (and Kagglers) approached this problem, and I wanted to be eyewitness to a potential overfitting disaster, should one occur.

The setup should be familiar: for selected date, stock pairs you are given 25 state variables, the two previous days of returns, and the first 120 minutes of returns. You are to predict the remaining 60 minutes of returns of that day and the following two days of returns for the stock. The metric used to score your predictions is a weighted mean absolute error, where presumably higher volatility names are downweighted in the final error metric. The training data consist of 40K observations, while the test data consist of 120K rows, for which one had to produce 744K predictions. First prize was a cool $20K. In addition to the prizes, Winton was explicitly looking for resumes.

I suspected that this competition would provide valuable data in my study of human overfitting of trading strategies. Towards that end, let us gather the public and private leaderboards. Recall that the public leaderboard is what participants see of their submissions during the competition period, based on around one quarter of the test set data, while the private leaderboard is the score of predictions on the remaining part of the test data, and is published in a big reveal at the close of the competition. Let's gather the leaderboard data.
(Those of you who want to play along at home can download my cut of the data.)

library(dplyr)
library(rvest)

# a function to load and process a leaderboard …
read more

R in Finance 2016

Fri 20 May 2016 by Steven

Review of R in Finance 2016 conference

read more

Getting Hired as a Data Scientist

Thu 19 May 2016 by Steven

A few months back I wrote about my experiences trying to hire a data scientist. It took some amount of work on our part. When we finally found the right candidate, our parent company told us that there wasn't actually any money to pay a candidate. This came as rather a surprise to all of us at our three person startup. This was the first indication that the wheels were coming off the bus, and two months later, we were all laid off and the company dissolved. Within just three months I went from hiring to scrambling for a job. Would I follow my own advice for job candidates? What's the startup climate like? Is it easy to find a job in the field?

getting a foot in the door

I decided to make of habit of (nearly) always writing a cover letter, although I quickly settled on two or three templates of cover letter, depending on the job function and industry. I found the address of each company and included it in the letter, mostly to confirm that the office was in the city of San Francisco. When I submitted an application, I would save the letter in my (private) applications repo on github, with a message. I had been warned by a friend who works in HR that cover letters were ignored in her office. In my experience, a cover letter, even a somewhat generic one, set apart casual applicants from the serious candidates.

I also made the not uncontroversial decision to send out my lengthy CV, rather than a one or two page resume. My thinking here was that it is easier for a hiring manager to read more details in your CV if they are interested than try to infer details from a terse one page …

read more