In a previous blog post I looked at opening moves and piece values in
Antichess based on games data downloaded from Lichess.
One peculiarity I noted there was that the Elo values (well, they are Glicko-2) on
Lichess are *miscalibrated* in the sense that they exaggerate the probability of a win.
Usually an Elo difference of 400 points is supposed to translate to 10-to-1 odds of
the higher rated player winning.
However, I found that the odds were somewhat less, maybe like 9-to-1.
While this seems like a minor point, it also means that the highest rated players effectively
have overinflated Elo scores (or at least compared to the *hoi polloi*, as only difference in Elo is meaningful).

In this blog post, I will examine the miscalibration of Antichess Elo. As in previous blog posts I view Elo through the lense of logistic regression. Letting \(p\) be the probability that White wins an Antichess match, and let \(\Delta e\) be the difference in Antichess Elo (White's minus Black's). Then the Elos are properly calibrated if

That is the odds scale as 10x for each 400 difference in Elo. The overall level of Elo is arbitrary, though I have schemes for fixing that.

Taking the natural log of both sides, we have

This is a statistician's logistic regression. And while we are using logistic regression, we can add terms. Because Antichess has been shown to be a winning game for White, it would seem that there should be some boost to White's odds that are independent of the Elo difference. So we change the equation to

for some unknown \(b\), which represents White's tempo advantage. If instead we write this as

then the constant \(c\) is in 'units' of Elo.

Given some data of pre-game Elos and outcomes, I will use logistic regression to estimate the constants \(c_1\) and \(c_2\) in the equation

where \(\Delta e\) are the measured Elos prior to the game. If \(c_1=1\), or is reasonably near it, then the Elos are properly calibrated. If \(c_1 < 1\) then the Elos have too much spread, if \(c_1 > 1\), they have too little spread.

I had a number of theories on why the data might appear miscalibrated, including:

- This is only an effect on tight time control matches.
- The miscalibration is only for low Elo players.
- The miscalibration is due to a bad implementation of Elo which has since been corrected.

To examine questions like these, I break out the regressions by groups.
That is, to examine *e.g.* time controls, I classify the matches into four groups,
then perform the regression with \(c_1, c_2\) for the "reference class",
and then deltas to the \(c_1, c_2\) for the other classes. The reference
class for time controls might be games played at 3+ minutes, while
the other classes are like <15 second games, or 30-60 second games.
This gives a way to see how far off \(c_1\) is for the different classes.
As it turns out, there is not much difference across these dimensions,
and I find poor calibration is not due to any of these putative explanations.

## Results

As before, I pulled rated games data from Lichess using code I wrote to download and parse the data, turning it into a CSV file. My analysis here is based off of v1 of this file, but please remember that Lichess is the ultimate copyright holder.

As in my previous analysis, I restrict attention to cases where the players already have 50 games in the database, to avoid burn-in issues. Except for the study on time controls, I will only look at matches played at 2+ minutes per side. I will generally restrict attention to matches between players with at least 1500 Elo pre-game.

First the regressions for time controls. I classify matches as based on their initial time, as being on time control of 15 seconds or less; 30 to 60 seconds; 90 to 120 seconds; or 180 seconds up to 600 seconds per side. The longest time controls are the reference class. In the table below the "estimate" is the estimated value of the \(c_1\) or \(c_2\), in Elo units. The "std.error" is the standard error, and the statistic is a Wald statistic. The p-values are all exceedingly small. White's Tempo advantage (\(c_2\)) is equal to around 15-20 Elo points. Note that the "Elo" here refers to the pre-game Elo difference and corresponds to \(c_1\) for the reference class. We see that it falls rather short of the value 1. Terms like "Elo:time_control<=15" are the deltas to that reference class value for pre-game Elo. Thus we see that for the ultrashort time control matches, the \(c_1\) is around 0.931 plus around 0.0171 for a total value of around 0.9481.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

Tempo | 15.90000 | 0.2300 | 70.026856 | 0.000000 |

Elo | 0.93100 | 0.0013 | 690.521161 | 0.000000 |

Tempo:time_control<=15 | 1.82000 | 0.6000 | 3.059462 | 0.002217 |

Tempo:time_control30-60 | 1.59000 | 0.3800 | 4.153711 | 0.000033 |

Tempo:time_control90-120 | 0.90900 | 0.4200 | 2.172289 | 0.029834 |

Elo:time_control<=15 | 0.01710 | 0.0035 | 4.825546 | 0.000001 |

Elo:time_control30-60 | 0.02830 | 0.0023 | 12.462710 | 0.000000 |

Elo:time_control90-120 | -0.00184 | 0.0024 | -0.753743 | 0.451003 |

I thought that perhaps the issue was due to matches where there is a large difference in pre-game Elo. Perhaps a low skill player can get lucky, and thus throw off the probablities. I perform the same regression as above, grouping matches by the absolute difference in Elo between them. A difference of 0 to 100 Elo is taken as the reference class. However, we see that Elo is still miscalibrated in this case. The effect is slightly worse when there is 600+ difference in pre-game Elo, but still, the original hypothesis is not valid.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

Tempo | 16.1000 | 0.2200 | 71.720779 | 0.000000 |

Elo | 0.9230 | 0.0040 | 229.052970 | 0.000000 |

Tempo:delta_elo(100,200] | 0.3540 | 0.3600 | 0.977188 | 0.328476 |

Tempo:delta_elo(200,400] | 1.8100 | 0.4200 | 4.347764 | 0.000014 |

Tempo:delta_elo(400,600] | 1.2000 | 1.1000 | 1.137562 | 0.255304 |

Tempo:delta_elo(600,Inf] | 10.0000 | 4.1000 | 2.422381 | 0.015419 |

Elo:delta_elo(100,200] | 0.0162 | 0.0045 | 3.623650 | 0.000290 |

Elo:delta_elo(200,400] | 0.0218 | 0.0042 | 5.164092 | 0.000000 |

Elo:delta_elo(400,600] | 0.0129 | 0.0046 | 2.811665 | 0.004929 |

Elo:delta_elo(600,Inf] | -0.0250 | 0.0074 | -3.363104 | 0.000771 |

Perhaps the *average* Elo can explain the effect: maybe luck plays a greater role among
lower skilled players.
I group matches by the average pre-game Elo of the players and run the regressions again.
Here 2000+ is the reference class.
Looking at the coefficients below we see that we still have miscalibration.
In fact, the effect is more muted for low skill players, who have closer to nominal
value of Elo.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

Tempo | 23.0000 | 0.2800 | 82.3523 | 0 |

Elo | 0.9100 | 0.0016 | 556.0643 | 0 |

Tempo:avg_elo(1500,1750] | -16.3000 | 0.4300 | -37.5902 | 0 |

Tempo:avg_elo(1750,2000] | -5.9000 | 0.3600 | -16.4339 | 0 |

Elo:avg_elo(1500,1750] | 0.0355 | 0.0028 | 12.5513 | 0 |

Elo:avg_elo(1750,2000] | 0.0460 | 0.0021 | 22.2081 | 0 |

Maybe this is a problem that has already been addressed by Lichess, some bug that affected how Elo (Glicko2, really) was being computed, and is no longer an issue. I classify games by the year they were played, with 2021 as the reference class. Indeed we see now that the value of Elo is near 1, while it was much lower in 2014 and 2015. So my leading theory is that something in the computation was previously off, but has perhaps been fixed?

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

Tempo | 1.72e+01 | 0.4000 | 42.480949 | 0.000000 |

Elo | 9.68e-01 | 0.0026 | 372.815587 | 0.000000 |

Tempo:play_year2014 | -1.68e+01 | 6.3000 | -2.658131 | 0.007858 |

Tempo:play_year2015 | -5.95e+00 | 0.7800 | -7.625736 | 0.000000 |

Tempo:play_year2016 | -4.94e+00 | 0.6100 | -8.099793 | 0.000000 |

Tempo:play_year2017 | 3.09e-02 | 0.5800 | 0.053356 | 0.957449 |

Tempo:play_year2018 | 7.33e-01 | 0.5700 | 1.284990 | 0.198796 |

Tempo:play_year2019 | 1.67e+00 | 0.5700 | 2.958669 | 0.003090 |

Tempo:play_year2020 | 9.34e-04 | 0.5100 | 0.001816 | 0.998551 |

Elo:play_year2014 | -9.07e-02 | 0.0400 | -2.294870 | 0.021741 |

Elo:play_year2015 | -1.03e-01 | 0.0046 | -22.212239 | 0.000000 |

Elo:play_year2016 | -8.32e-02 | 0.0036 | -23.427809 | 0.000000 |

Elo:play_year2017 | -3.15e-02 | 0.0035 | -8.991584 | 0.000000 |

Elo:play_year2018 | -3.76e-02 | 0.0035 | -10.798132 | 0.000000 |

Elo:play_year2019 | -2.48e-02 | 0.0035 | -7.147420 | 0.000000 |

Elo:play_year2020 | 1.52e-02 | 0.0033 | 4.620739 | 0.000004 |

Going back to the original regression, if we now restrict our attention to matches in 2020 and later, we see that Elo seems well calibrated at longer time controls, and is perhaps even more so at shorter time controls.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

Tempo | 16.7000 | 0.3700 | 45.46794 | 0.000000 |

Elo | 0.9700 | 0.0023 | 418.37293 | 0.000000 |

Tempo:time_control<=15 | -0.0320 | 1.2000 | -0.02778 | 0.977838 |

Tempo:time_control30-60 | 0.6470 | 0.5900 | 1.10287 | 0.270085 |

Tempo:time_control90-120 | 1.6400 | 0.6900 | 2.38268 | 0.017187 |

Elo:time_control<=15 | 0.0316 | 0.0076 | 4.17463 | 0.000030 |

Elo:time_control30-60 | 0.0118 | 0.0037 | 3.17373 | 0.001505 |

Elo:time_control90-120 | 0.0133 | 0.0044 | 3.05286 | 0.002267 |

One implication of this is that my study of piece values and opening values should be re-run with an adjustment for pre-game Elo from prior to 2020. I don't think this will have a huge effect on the outcomes, however.