Gilgamath

Chess Tactics.

Thu 30 March 2017 by Steven E. Pav

I have become more interested in chess in the last year, though I'm still pretty much crap at it. Rather than play games, I am practicing tactics at chesstempo. Basically you are presented with a chess puzzle, which is selected based on your estimated tactical 'Elo' rating, and your rating (and the puzzle's) is adjusted based on whether you solve it correctly. (Without time limit for standard problems, though I believe one can also train in 'blitz' mode.) I decided to look at the data.

I have a few reasons for this exercise:

To see if I could do it. You cannot easily download your stats from the site unless pay for gold membership. (I skimped and bought a silver.) I wanted to practice my web scraping skills, which I have not exercised in a while.
To see if the site's rating system made sense as a logistic regression, and were consistent with the 'standard' definition of Elo rating.
To see if I was getting better.
To see if there was anything simple I could do to improve, like take longer for problems, or practice certain kinds of problems.
To look for 'hot hands' phenomenon, which would translate into autocorrelated residuals.

The bad and the ugly

Scraping my statistics into a CSV turned out to be fairly straightforward. The statistics page will look uninteresting if you are not a member. Even if you are, the data themselves are served via JavaScript, not in raw HTML. While this could in theory be solved via, say, phantomJS, I opted to work with the developer console in Chrome directly.

First go to your statistics page in Chrome. Then conjure the developer console by pressing <CTRL>-<SHIFT>-I. A frame should appear. Click on the 'Console' tab, then type in it: copy(document.body …

Lego Pricing.

Mon 06 March 2017 by Steven E. Pav

It is time to get kiddo a new Lego set, as he's been on a bender this week, building everything he can get his hands on. I wanted to optimize play time per dollar spent, so I set out to look for Lego pricing data.

Not surprisingly, there are a number of good sources for this data. The best I found was at brickset. Sign up for an account, then go to their query builder. I built a query requesting all sets from 2011 onwards, then selected the CSV option, copied the data into my clipboard, then dumped it via xclip -o > brickset_db.csv. The brickset data is updated over time, so there's no reason to prefer my file to one you download yourself.

First I load the data in R, filter based on availability of Piece and Price data, then remove certain themes (Books, Duplo, and so on). I then subselect themes based on having a large range of prices and of number of pieces:

library(readr)
library(dplyr)
indat <- readr::read_csv('../data/brickset_db.csv') %>%
    select(Number,Theme,Subtheme,Year,Name,Pieces,USPrice)

## Rows: 4843 Columns: 18

## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (7): Number, Theme, Subtheme, Name, ImageURL, Owned, Wanted
## dbl (10): SetID, Variant, Year, Minifigs, Pieces, UKPrice, USPrice, CAPrice,...
## lgl  (1): Rating

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

subdat <- indat %>%
    filter(!is.na(Pieces),Pieces >= 10,
                 !is.na(USPrice),USPrice > 1,
                 !grepl('^(Books|Mindstorms|Duplo|.+Minifigures|Power Func|Games|Education|Serious)',Theme)) 

subok <- subdat %>%
    group_by(Theme) %>%
        summarize(many_sets=(sum(!is.nan(USPrice)) >= 10),
                     piece_spread=((max(Pieces) / min(Pieces)) >= 5),
                     price_spread=((max(USPrice) / min(USPrice)) >= 4)) %>%
    ungroup() %>%
    filter(many_sets & piece_spread & price_spread) %>% 
    select(-many_sets,-piece_spread,-price_spread)

subdat <- subdat %>%
    inner_join(subok,by='Theme …

Odds of Winning Your Oscar Pool.

Mon 30 January 2017 by Steven E. Pav

In a previous blog post, I used a Bradley-Terry model to analyze Oscar Best Picture winners, using the best picture dataset. In that post I presented the results of likelihood tests which showed 'significant' relationships between winning the Best Picture category and conomination for other awards, MovieLens ratings, and (spuriously) number of IMDb votes. It can be hard to interpret the effect sizes and \(t\) statistics from a Bradley-Terry model. So here I will try to estimate the probability of correctly guessing the Best Picture winner using this model.

There is no apparent direct translation from the coefficients of the model fit to the probability of correctly forecasting a winner. Nor can you transform the maximized likelihood, or an R-squared. Moreover, it will depend on the number of nominees (traditionally there were only 5 Best Picture nominations--these days it's upwards of 9), and how they differ in the independent variables. Here I will keep it simple and use cross validation.

I modified the oslm code to include a predict method. So here, I load the data and the code, and remove duplicates and restrict the data to the period after 1945. I construct the model formula, based on co-nomination, then test in three ways:

A purely 'in sample' validation where all the data are used to build the model, then tested. (The film with the highest forecast probability of winning is chosen as the predicted winner, of course.) This should give the most optimistic view of performance, even though the likelihood maximization problem does not directly select for this metric.
A walk-forward cross validation where the data up through year \(y-1\) are used to build the model, then it is used to forecast the winners in year \(y\). This is perhaps the most honest kind of cross validation for time …

Predicting Best Picture Winners.

Thu 26 January 2017 by Steven E. Pav

In a previous blog post, I described some data I had put together for predicting winners in the Best Picture category of the Oscars. Here I will use a Bradley-Terry model to describe this dataset.

To test these models, I wrote an R function called oslm. I have posted the code. This code allows one to model the likelihood of winning an award as a function of some independent variables on each film, taking into account that one and only one film wins in a given year. The code supports computation of the likelihood function (and gradient and Hessian), and allows the maxLik package to do the heavy lifting. Supposing one has a data frame with boolean column winner to denote winners and year to hold the award year, and some independent variables, say x1, x2, and so on. Then one can invoke this code as

modl <- oslm(winner:year ~ x1 + x2,data=my_dataframe)

This is a bit heterodox, using the colon in the left hand side. However, I wasn't sure where else to put it, and the code was not too vile to write. Since I did not know the name of this model, I did not know what existing packages supported this kind of analysis, so I wrote my own silly function.

Who's a winner?

Let's use this data and code instead of talking about it. First, I load the data and then rid it of duplicates (sorry about those), and convert some Boolean independent variables to numeric. I source the oslm code and then try a very simple model: looking at films from 1950 onward, can I predict the probability of winning Best Picture in terms of the (log of the) number of votes it receives on IMDb, stored in the votes variable:

library(readr)
library(dplyr …