It is time to get kiddo a new Lego set, as he's been on a bender this week, building everything he can get his hands on. I wanted to optimize play time per dollar spent, so I set out to look for Lego pricing data.

Not surprisingly, there are a number of good sources for this data. The best I found was at brickset. Sign up for an account, then go to their query builder. I built a query requesting all sets from 2011 onwards, then selected the CSV option, copied the data into my clipboard, then dumped it via xclip -o > brickset_db.csv. The brickset data is updated over time, so there's no reason to prefer my file to one you download yourself.

First I load the data in R, filter based on availability of Piece and Price data, then remove certain themes (Books, Duplo, and so on). I then subselect themes based on having a large range of prices and of number of pieces:

library(readr)
library(dplyr)
indat <- readr::read_csv('../data/brickset_db.csv') %>%
    select(Number,Theme,Subtheme,Year,Name,Pieces,USPrice) 
## Rows: 4843 Columns: 18
## -- Column specification ----------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (7): Number, Theme, Subtheme, Name, ImageURL, Owned, Wanted
## dbl (10): SetID, Variant, Year, Minifigs, Pieces, UKPrice, USPrice, CAPrice,...
## lgl  (1): Rating
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
subdat <- indat %>%
    filter(!is.na(Pieces),Pieces >= 10,
                 !is.na(USPrice),USPrice > 1,
                 !grepl('^(Books|Mindstorms|Duplo|.+Minifigures|Power Func|Games|Education|Serious)',Theme)) 

subok <- subdat %>%
    group_by(Theme) %>%
        summarize(many_sets=(sum(!is.nan(USPrice)) >= 10),
                     piece_spread=((max(Pieces) / min(Pieces)) >= 5),
                     price_spread=((max(USPrice) / min(USPrice)) >= 4)) %>%
    ungroup() %>%
    filter(many_sets & piece_spread & price_spread) %>% 
    select(-many_sets,-piece_spread,-price_spread)

subdat <- subdat %>%
    inner_join(subok,by='Theme')

set.seed(1234)
subdat %>% sample_n(10) %>% 
    kable(padding=10,digits=4,
                caption='Random sampling of some Lego sets',
                format='html')
Random sampling of some Lego sets
Number Theme Subtheme Year Name Pieces USPrice
10722 Juniors Ninjago 2016 Snake Showdown 92 14.99
41086 Friends Vet 2015 Vet Ambulance 89 9.99
70819 The LEGO Movie NA 2015 Bad Cop Car Chase 292 29.99
75073 Star Wars MicroFighters 2015 Vulture Droid 77 9.99
70704 Space Galaxy Squad 2013 Vermin Vaporizer 506 59.99
75052 Star Wars Episode IV 2014 Mos Eisley Cantina 615 69.99
75020 Star Wars Episode VI 2013 Jabba's Sail Barge 850 119.99
75058 Star Wars Episode I 2014 MTT 954 89.99
9467 Monster Fighters NA 2012 The Ghost Train 741 79.99
60046 City Police 2014 Helicopter Surveillance 528 74.99

Now I plot the results for a few different themes, in log scale, price versus pieces. The power law fits should have slope around 1, while the intercept is like the natural log of price per piece. I expected this to be about a dime, corresponding to a value of -2.303. Instead, we see larger values, with a higher premium for Star Wars themes (I knew it), than the old Creator line.

library(ggplot2)
library(ggpmisc)

fmla <- y~x

ph <- subdat %>%
    filter(Pieces > 50) %>%
    mutate(Year=factor(Year)) %>%
    mutate(Theme=forcats::fct_reorder(Theme,USPrice)) %>%
    filter(grepl('Star Wars|Minecraft|Disney|Creator',as.character(Theme))) %>%
    ggplot(aes(Pieces,USPrice,color=Theme)) + geom_point() +
    scale_x_log10() + scale_y_log10() +
    facet_wrap(~Theme) + geom_smooth(method='lm') + 
    ggpmisc::stat_poly_eq(formula=fmla, parse=TRUE, aes(label=..eq.label..)) + 
    labs(x='Number of pieces',y='Price ($ US)',title='LEGO sets 2015-2017\ndata courtesy of brickset.com')
ph

plot of chunk lego_plot1

However, usually I think of a theme premium as an increase in the intercept in a linear model. Let's use linear regression to look at a model for this. I fit the cost per piece slope as a function of release year, then fit an intercept term for each Theme in my set, then order by the estimated intercept terms. We see that the price per piece is indeed around a dime, and has held steady over time. There is a fixed premium for the branded themes (Star Wars, the Hobbit, and so on), but, somewhat to my surprise, the City themed sets also seem to be more expensive. Note the apparent discount for Lego Movie tie-in sets. Luckily for me, the kiddo likes the Creator 3-in-1 sets.

mod_linear <- lm(USPrice ~ Pieces:Year + Theme - 1,data=subdat %>% mutate(Year=factor(Year)))
mod_linear %>%
    summary() %>%
    coefficients() %>%
    as.data.frame() %>%
    add_rownames() %>% 
        mutate(is_slope = !grepl('Pieces',rowname)) %>% 
        arrange(is_slope,Estimate) %>%
        select(-is_slope) %>%
    kable(row.names=FALSE,
                padding=10,digits=3,
                caption='Linear Regression on Lego Set prices',
                format='html')
Linear Regression on Lego Set prices
rowname Estimate Std. Error t value Pr(>|t|)
Pieces:Year2017 0.090 0.002 50.925 0.000
Pieces:Year2016 0.094 0.001 82.774 0.000
Pieces:Year2012 0.098 0.002 52.276 0.000
Pieces:Year2014 0.098 0.002 63.621 0.000
Pieces:Year2015 0.099 0.001 69.229 0.000
Pieces:Year2013 0.103 0.002 63.469 0.000
Pieces:Year2011 0.106 0.002 55.747 0.000
ThemeAdvanced Models -33.749 2.814 -11.991 0.000
ThemeClassic -18.404 3.531 -5.212 0.000
ThemeBricks and More -8.464 2.985 -2.835 0.005
ThemeThe LEGO Movie -3.892 2.858 -1.362 0.173
ThemeCreator -1.936 1.290 -1.501 0.134
ThemeSpeed Champions -1.867 3.385 -0.551 0.581
ThemeElves 0.024 2.774 0.009 0.993
ThemeMonster Fighters 0.200 3.909 0.051 0.959
ThemeMiscellaneous 0.608 2.693 0.226 0.821
ThemeArchitecture 0.842 2.349 0.359 0.720
ThemeThe Lord of the Rings 1.037 3.323 0.312 0.755
ThemeCastle 1.112 3.666 0.303 0.762
ThemeUltra Agents 1.911 3.406 0.561 0.575
ThemeFriends 1.952 1.015 1.922 0.055
ThemeTeenage Mutant Ninja Turtles 2.189 3.181 0.688 0.491
ThemeCars 2.495 2.715 0.919 0.358
ThemeNexo Knights 2.759 1.774 1.555 0.120
ThemeSpace 3.198 2.851 1.122 0.262
ThemePirates of the Caribbean 3.495 3.892 0.898 0.369
ThemeLegends of Chima 3.686 1.298 2.840 0.005
ThemeNinjago 3.706 1.084 3.420 0.001
ThemeDC Comics Super Heroes 4.625 1.841 2.512 0.012
ThemeTechnic 4.950 1.570 3.152 0.002
ThemeMarvel Super Heroes 5.313 1.608 3.304 0.001
ThemeDisney 5.802 2.403 2.415 0.016
ThemeMinecraft 6.072 2.422 2.507 0.012
ThemeThe LEGO Batman Movie 6.265 2.896 2.164 0.031
ThemeHERO Factory 6.328 1.508 4.197 0.000
ThemeThe Hobbit 7.125 2.997 2.377 0.018
ThemeStar Wars 7.972 0.890 8.959 0.000
ThemeDC Super Hero Girls 8.028 4.037 1.989 0.047
ThemeJuniors 8.251 2.104 3.922 0.000
ThemeCity 10.600 0.884 11.988 0.000
ThemeDimensions 15.777 1.629 9.686 0.000