## Overfit Like a Pro

Tue 24 May 2016
by

Steven E. Pav
Earlier this year, I participated in the
Winton Stock Market Challenge
on Kaggle. I wanted to explore the freely available
tools in R for performing what I had routinely done in Matlab
in my previous career, I was curious how a large
investment management firm (and Kagglers)
approached this problem, and I wanted to be eyewitness to a potential
overfitting disaster, should one occur.

The setup should be familiar: for selected date, stock pairs you are
given 25 state variables, the two previous days of returns, and the
first 120 minutes of returns. You are to predict the remaining 60 minutes
of returns of that day and the following two days of returns for the stock.
The metric used to score your predictions is a
weighted mean absolute error,
where presumably higher volatility names are downweighted in the final
error metric. The training data consist of 40K observations, while
the test data consist of 120K rows, for which one had to produce 744K
predictions. First prize was a cool $20K. In addition to the prizes,
Winton was explicitly looking for resumes.

I suspected that this competition would provide valuable data
in my study of human overfitting of trading strategies. Towards
that end, let us gather the public and private
leaderboards.
Recall that the public leaderboard is what participants see of
their submissions during the competition period, based on around
one quarter of the test set data, while the private leaderboard
is the score of predictions on the remaining part of the test data,
and is published in a big reveal at the close of the competition.
Let's gather the leaderboard data.

(Those of you who want to play along at home can download
my cut of the data.)

library(dplyr)
library(rvest)
# a function to load and process a leaderboard …

read more
## You Deserve Expensive Champagne ... If You Buy It.

Sat 26 December 2015
by

Steven E. Pav
I received some taster ratings from the champagne party we
attended last week.
I joined the raw ratings with the bottle information to
create a single aggregated dataset.
This is a 'non-normal' form, but simplest to distribute. Here is a taste:

library(dplyr)
champ <- read.csv('../data/champagne_ratings.csv',stringsAsFactors=FALSE)
champ %>% select(winery,purchase_price_per_liter,raternum,rating) %>%
head(8) %>% kable(format='markdown')

winery |
purchase_price_per_liter |
raternum |
rating |

Barons de Rothschild |
80.00000 |
1 |
10 |

Onward Petillant Naturel 2014 Malavasia Bianca |
33.33333 |
1 |
4 |

Chandon Rose Method Traditionnelle |
18.66667 |
1 |
8 |

Martini Prosecco from Italy |
21.32000 |
1 |
8 |

Roederer Estate Brut |
33.33333 |
1 |
8 |

Kirkland Asolo Prosecco Superiore |
9.32000 |
1 |
7 |

Champagne Tattinger Brute La Francaise |
46.66667 |
1 |
6 |

Schramsberg Reserver 2001 |
132.00000 |
1 |
6 |

Recall that the rules of the contest dictate that the average rating of each
bottle was computed, then divided by 25 dollars more than the
price (presumably for a 750ml bottle). Depending on whether the average
ratings were compressed around the high end of the zero to ten scale,
or around the low end, one would wager on either the cheapest bottles, or more
moderately priced offerings. (Based on my
previous analysis, I brought the
*Menage a Trois Prosecco*, rated at 91 points, but available at Safeway for
10 dollars.) It is easy to compute the raw averages using `dplyr`

:

avrat <- champ %>%
group_by(winery,bottle_num,purchase_price_per_liter) %>%
summarize(avg_rating=mean(rating)) %>%
ungroup() %>%
arrange(desc(avg_rating))
avrat %>% head(8) %>% kable(format='markdown')

winery |
bottle_num |
purchase_price_per_liter |
avg_rating |

Desuderi Jeio |
4 |
22.66667 |
6.750000 |

Gloria Ferrer Sonoma Brut |
19 |
20.00000 |
6.750000 |

Roederer Estate Brut |
12 |
34.66667 |
6.642857 |

Charles Collin Rose |
34 |
33.33333 |
6.636364 |

Roederer Estate Brut |
13 |
33.33333 |
6.500000 |

Gloria Ferrer Sonoma Brut |
11 |
21 … |

read more
## Champagne Party

Thu 17 December 2015
by

Steven E. Pav
We have been invited to a champagne tasting party and competition.
The rules of the contest are as follows: partygoers bring a bottle
of champagne to share. They taste, then rate the different
champagnes on offer, with ratings on a scale of 1 through 10.
The average rating is computed for each bottle, then
divided by the price (plus some offset) to arrive at an
adjusted quality score. The champagne with the highest score
nets a prize, and considerable bragging rights, for its owner.
Presumably the offset is introduced to prevent small denominators
from dominating the rating, and is advertised to have a value of
around $25. The 'price' is, one infers, for a standard 750 ml bottle.

I decided to do my homework for a change, rather than SWAG it.
I have been doing a lot of web scraping lately, so it was pretty
simple to gather some data on champagnes
from wine dot com. This file includes the advertised and sale prices,
as well as advertised ratings from Wine Spectator (WS), Wine Enthusiast
(WE), and so on. Some of the bottles are odd sizes, so I compute the
cost per liter as well. (By the way, many people would consider the data
collection the hard part of the problem. `rvest`

made it pretty easy, though.)
Here's a taste:

library(dplyr)
library(magrittr)
champ <- read.csv('../data/champagne.csv')
champ %>% arrange(price_per_liter) %>% head(10) %>% kable(format='markdown')

name |
price |
sale_price |
WS |
WE |
WandS |
WW |
TP |
JS |
ST |
liters |
price_per_liter |

Pol Clement Rose Sec |
8.99 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
0.75 |
11.99 |

Freixenet Carta Nevada Brut |
8.99 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
0.75 |
11.99 |

Wolf Blass Yellow Label Brut |
8.99 |
NA |
NA |
NA |
NA |
NA |
NA |
NA … |

read more