In a previous blog post I looked at opening moves and piece values in Antichess based on games data downloaded from Lichess. One peculiarity I noted there was that the Elo values (well, they are Glicko-2) on Lichess are miscalibrated in the sense that they exaggerate the probability of a win. Usually an Elo difference of 400 points is supposed to translate to 10-to-1 odds of the higher rated player winning. However, I found that the odds were somewhat less, maybe like 9-to-1. While this seems like a minor point, it also means that the highest rated players effectively have overinflated Elo scores (or at least compared to the hoi polloi, as only difference in Elo is meaningful).

In this blog post, I will examine the miscalibration of Antichess Elo. As in previous blog posts I view Elo through the lense of logistic regression. Letting \(p\) be the probability that White wins an Antichess match, and let \(\Delta e\) be the difference in Antichess Elo (White's minus Black's). Then the Elos are properly calibrated if

$$ \left(\frac{p}{1-p}\right) = 10^\frac{\Delta e}{400}. $$

That is the odds scale as 10x for each 400 difference in Elo. The overall level of Elo is arbitrary, though I have schemes for fixing that.

Taking the natural log of both sides, we have

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e. $$

This is a statistician's logistic regression. And while we are using logistic regression, we can add terms. Because Antichess has been shown to be a winning game for White, it would seem that there should be some boost to White's odds that are independent of the Elo difference. So we change the equation to

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e + b, $$

for some unknown \(b\), which represents White's tempo advantage. If instead we write this as

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e + \frac{\operatorname{log}(10)}{400}c, $$

then the constant \(c\) is in 'units' of Elo.

Given some data of pre-game Elos and outcomes, I will use logistic regression to estimate the constants \(c_1\) and \(c_2\) in the equation

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\left(c_1 \Delta e + c_2\right), $$

where \(\Delta e\) are the measured Elos prior to the game. If \(c_1=1\), or is reasonably near it, then the Elos are properly calibrated. If \(c_1 < 1\) then the Elos have too much spread, if \(c_1 > 1\), they have too little spread.

I had a number of theories on why the data might appear miscalibrated, including:

  • This is only an effect on tight time control matches.
  • The miscalibration is only for low Elo players.
  • The miscalibration is due to a bad implementation of Elo which has since been corrected.

To examine questions like these, I break out the regressions by groups. That is, to examine e.g. time controls, I classify the matches into four groups, then perform the regression with \(c_1, c_2\) for the "reference class", and then deltas to the \(c_1, c_2\) for the other classes. The reference class for time controls might be games played at 3+ minutes, while the other classes are like <15 second games, or 30-60 second games. This gives a way to see how far off \(c_1\) is for the different classes. As it turns out, there is not much difference across these dimensions, and I find poor calibration is not due to any of these putative explanations.

Results

As before, I pulled rated games data from Lichess using code I wrote to download and parse the data, turning it into a CSV file. My analysis here is based off of v1 of this file, but please remember that Lichess is the ultimate copyright holder.

As in my previous analysis, I restrict attention to cases where the players already have 50 games in the database, to avoid burn-in issues. Except for the study on time controls, I will only look at matches played at 2+ minutes per side. I will generally restrict attention to matches between players with at least 1500 Elo pre-game.

First the regressions for time controls. I classify matches as based on their initial time, as being on time control of 15 seconds or less; 30 to 60 seconds; 90 to 120 seconds; or 180 seconds up to 600 seconds per side. The longest time controls are the reference class. In the table below the "estimate" is the estimated value of the \(c_1\) or \(c_2\), in Elo units. The "std.error" is the standard error, and the statistic is a Wald statistic. The p-values are all exceedingly small. White's Tempo advantage (\(c_2\)) is equal to around 15-20 Elo points. Note that the "Elo" here refers to the pre-game Elo difference and corresponds to \(c_1\) for the reference class. We see that it falls rather short of the value 1. Terms like "Elo:time_control<=15" are the deltas to that reference class value for pre-game Elo. Thus we see that for the ultrashort time control matches, the \(c_1\) is around 0.931 plus around 0.0171 for a total value of around 0.9481.

term estimate std.error statistic p.value
Tempo 15.90000 0.2300 70.026856 0.000000
Elo 0.93100 0.0013 690.521161 0.000000
Tempo:time_control<=15 1.82000 0.6000 3.059462 0.002217
Tempo:time_control30-60 1.59000 0.3800 4.153711 0.000033
Tempo:time_control90-120 0.90900 0.4200 2.172289 0.029834
Elo:time_control<=15 0.01710 0.0035 4.825546 0.000001
Elo:time_control30-60 0.02830 0.0023 12.462710 0.000000
Elo:time_control90-120 -0.00184 0.0024 -0.753743 0.451003

I thought that perhaps the issue was due to matches where there is a large difference in pre-game Elo. Perhaps a low skill player can get lucky, and thus throw off the probablities. I perform the same regression as above, grouping matches by the absolute difference in Elo between them. A difference of 0 to 100 Elo is taken as the reference class. However, we see that Elo is still miscalibrated in this case. The effect is slightly worse when there is 600+ difference in pre-game Elo, but still, the original hypothesis is not valid.

term estimate std.error statistic p.value
Tempo 16.1000 0.2200 71.720779 0.000000
Elo 0.9230 0.0040 229.052970 0.000000
Tempo:delta_elo(100,200] 0.3540 0.3600 0.977188 0.328476
Tempo:delta_elo(200,400] 1.8100 0.4200 4.347764 0.000014
Tempo:delta_elo(400,600] 1.2000 1.1000 1.137562 0.255304
Tempo:delta_elo(600,Inf] 10.0000 4.1000 2.422381 0.015419
Elo:delta_elo(100,200] 0.0162 0.0045 3.623650 0.000290
Elo:delta_elo(200,400] 0.0218 0.0042 5.164092 0.000000
Elo:delta_elo(400,600] 0.0129 0.0046 2.811665 0.004929
Elo:delta_elo(600,Inf] -0.0250 0.0074 -3.363104 0.000771

Perhaps the average Elo can explain the effect: maybe luck plays a greater role among lower skilled players. I group matches by the average pre-game Elo of the players and run the regressions again. Here 2000+ is the reference class. Looking at the coefficients below we see that we still have miscalibration. In fact, the effect is more muted for low skill players, who have closer to nominal value of Elo.

term estimate std.error statistic p.value
Tempo 23.0000 0.2800 82.3523 0
Elo 0.9100 0.0016 556.0643 0
Tempo:avg_elo(1500,1750] -16.3000 0.4300 -37.5902 0
Tempo:avg_elo(1750,2000] -5.9000 0.3600 -16.4339 0
Elo:avg_elo(1500,1750] 0.0355 0.0028 12.5513 0
Elo:avg_elo(1750,2000] 0.0460 0.0021 22.2081 0

Maybe this is a problem that has already been addressed by Lichess, some bug that affected how Elo (Glicko2, really) was being computed, and is no longer an issue. I classify games by the year they were played, with 2021 as the reference class. Indeed we see now that the value of Elo is near 1, while it was much lower in 2014 and 2015. So my leading theory is that something in the computation was previously off, but has perhaps been fixed?

term estimate std.error statistic p.value
Tempo 1.72e+01 0.4000 42.480949 0.000000
Elo 9.68e-01 0.0026 372.815587 0.000000
Tempo:play_year2014 -1.68e+01 6.3000 -2.658131 0.007858
Tempo:play_year2015 -5.95e+00 0.7800 -7.625736 0.000000
Tempo:play_year2016 -4.94e+00 0.6100 -8.099793 0.000000
Tempo:play_year2017 3.09e-02 0.5800 0.053356 0.957449
Tempo:play_year2018 7.33e-01 0.5700 1.284990 0.198796
Tempo:play_year2019 1.67e+00 0.5700 2.958669 0.003090
Tempo:play_year2020 9.34e-04 0.5100 0.001816 0.998551
Elo:play_year2014 -9.07e-02 0.0400 -2.294870 0.021741
Elo:play_year2015 -1.03e-01 0.0046 -22.212239 0.000000
Elo:play_year2016 -8.32e-02 0.0036 -23.427809 0.000000
Elo:play_year2017 -3.15e-02 0.0035 -8.991584 0.000000
Elo:play_year2018 -3.76e-02 0.0035 -10.798132 0.000000
Elo:play_year2019 -2.48e-02 0.0035 -7.147420 0.000000
Elo:play_year2020 1.52e-02 0.0033 4.620739 0.000004

Going back to the original regression, if we now restrict our attention to matches in 2020 and later, we see that Elo seems well calibrated at longer time controls, and is perhaps even more so at shorter time controls.

term estimate std.error statistic p.value
Tempo 16.7000 0.3700 45.46794 0.000000
Elo 0.9700 0.0023 418.37293 0.000000
Tempo:time_control<=15 -0.0320 1.2000 -0.02778 0.977838
Tempo:time_control30-60 0.6470 0.5900 1.10287 0.270085
Tempo:time_control90-120 1.6400 0.6900 2.38268 0.017187
Elo:time_control<=15 0.0316 0.0076 4.17463 0.000030
Elo:time_control30-60 0.0118 0.0037 3.17373 0.001505
Elo:time_control90-120 0.0133 0.0044 3.05286 0.002267

One implication of this is that my study of piece values and opening values should be re-run with an adjustment for pre-game Elo from prior to 2020. I don't think this will have a huge effect on the outcomes, however.