Antichess Elo Problems

Sun 19 September 2021 by Steven

In a previous blog post I looked at opening moves and piece values in Antichess based on games data downloaded from Lichess. One peculiarity I noted there was that the Elo values (well, they are Glicko-2) on Lichess are miscalibrated in the sense that they exaggerate the probability of a win. Usually an Elo difference of 400 points is supposed to translate to 10-to-1 odds of the higher rated player winning. However, I found that the odds were somewhat less, maybe like 9-to-1. While this seems like a minor point, it also means that the highest rated players effectively have overinflated Elo scores (or at least compared to the hoi polloi, as only difference in Elo is meaningful).

In this blog post, I will examine the miscalibration of Antichess Elo. As in previous blog posts I view Elo through the lense of logistic regression. Letting $p$ be the probability that White wins an Antichess match, and let $\Delta e$ be the difference in Antichess Elo (White's minus Black's). Then the Elos are properly calibrated if

$$ \left(\frac{p}{1-p}\right) = 10^\frac{\Delta e}{400}. $$

That is the odds scale as 10x for each 400 difference in Elo. The overall level of Elo is arbitrary, though I have schemes for fixing that.

Taking the natural log of both sides, we have

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e. $$

This is a statistician's logistic regression. And while we are using logistic regression, we can add terms. Because Antichess has been shown to be a winning game for White, it would seem that there should be some boost to White's odds that are independent of the Elo difference. So we change the equation to

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e + b, $$

for some unknown $b$, which represents White's tempo advantage. If instead we write this as

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\Delta e + \frac{\operatorname{log}(10)}{400}c, $$

then the constant $c$ is in 'units' of Elo.

Given some data of pre-game Elos and outcomes, I will use logistic regression to estimate the constants $c_1$ and $c_2$ in the equation

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\left(c_1 \Delta e + c_2\right), $$

where $\Delta e$ are the measured Elos prior to the game. If $c_1=1$, or is reasonably near it, then the Elos are properly calibrated. If $c_1 < 1$ then the Elos have too much spread, if $c_1 > 1$, they have too little spread.

I had a number of theories on why the data might appear miscalibrated, including:

This is only an effect on tight time control matches.
The miscalibration is only for low Elo players.
The miscalibration is due to a bad implementation of Elo which has since been corrected.

To examine questions like these, I break out the regressions by groups. That is, to examine e.g. time controls, I classify the matches into four groups, then perform the regression with $c_1, c_2$ for the "reference class", and then deltas to the $c_1, c_2$ for the other classes. The reference class for time controls might be games played at 3+ minutes, while the other classes are like <15 second games, or 30-60 second games. This gives a way to see how far off $c_1$ is for the different classes. As it turns out, there is not much difference across these dimensions, and I find poor calibration is not due to any of these putative explanations.

Results

As before, I pulled rated games data from Lichess using code I wrote to download and parse the data, turning it into a CSV file. My analysis here is based off of v1 of this file, but please remember that Lichess is the ultimate copyright holder.

As in my previous analysis, I restrict attention to cases where the players already have 50 games in the database, to avoid burn-in issues. Except for the study on time controls, I will only look at matches played at 2+ minutes per side. I will generally restrict attention to matches between players with at least 1500 Elo pre-game.

First the regressions for time controls. I classify matches as based on their initial time, as being on time control of 15 seconds or less; 30 to 60 seconds; 90 to 120 seconds; or 180 seconds up to 600 seconds per side. The longest time controls are the reference class. In the table below the "estimate" is the estimated value of the $c_1$ or $c_2$, in Elo units. The "std.error" is the standard error, and the statistic is a Wald statistic. The p-values are all exceedingly small. White's Tempo advantage ($c_2$) is equal to around 15-20 Elo points. Note that the "Elo" here refers to the pre-game Elo difference and corresponds to $c_1$ for the reference class. We see that it falls rather short of the value 1. Terms like "Elo:time_control<=15" are the deltas to that reference class value for pre-game Elo. Thus we see that for the ultrashort time control matches, the $c_1$ is around 0.931 plus around 0.0171 for a total value of around 0.9481.

term	estimate	std.error	statistic	p.value
Tempo	15.90000	0.2300	70.026856	0.000000
Elo	0.93100	0.0013	690.521161	0.000000
Tempo:time_control<=15	1.82000	0.6000	3.059462	0.002217
Tempo:time_control30-60	1.59000	0.3800	4.153711	0.000033
Tempo:time_control90-120	0.90900	0.4200	2.172289	0.029834
Elo:time_control<=15	0.01710	0.0035	4.825546	0.000001
Elo:time_control30-60	0.02830	0.0023	12.462710	0.000000
Elo:time_control90-120	-0.00184	0.0024	-0.753743	0.451003

I thought that perhaps the issue was due to matches where there is a large difference in pre-game Elo. Perhaps a low skill player can get lucky, and thus throw off the probablities. I perform the same regression as above, grouping matches by the absolute difference in Elo between them. A difference of 0 to 100 Elo is taken as the reference class. However, we see that Elo is still miscalibrated in this case. The effect is slightly worse when there is 600+ difference in pre-game Elo, but still, the original hypothesis is not valid.

term	estimate	std.error	statistic	p.value
Tempo	16.1000	0.2200	71.720779	0.000000
Elo	0.9230	0.0040	229.052970	0.000000
Tempo:delta_elo(100,200]	0.3540	0.3600	0.977188	0.328476
Tempo:delta_elo(200,400]	1.8100	0.4200	4.347764	0.000014
Tempo:delta_elo(400,600]	1.2000	1.1000	1.137562	0.255304
Tempo:delta_elo(600,Inf]	10.0000	4.1000	2.422381	0.015419
Elo:delta_elo(100,200]	0.0162	0.0045	3.623650	0.000290
Elo:delta_elo(200,400]	0.0218	0.0042	5.164092	0.000000
Elo:delta_elo(400,600]	0.0129	0.0046	2.811665	0.004929
Elo:delta_elo(600,Inf]	-0.0250	0.0074	-3.363104	0.000771

Perhaps the average Elo can explain the effect: maybe luck plays a greater role among lower skilled players. I group matches by the average pre-game Elo of the players and run the regressions again. Here 2000+ is the reference class. Looking at the coefficients below we see that we still have miscalibration. In fact, the effect is more muted for low skill players, who have closer to nominal value of Elo.

term	estimate	std.error	statistic
Tempo	23.0000	0.2800	82.3523
Elo	0.9100	0.0016	556.0643
Tempo:avg_elo(1500,1750]	-16.3000	0.4300	-37.5902
Tempo:avg_elo(1750,2000]	-5.9000	0.3600	-16.4339
Elo:avg_elo(1500,1750]	0.0355	0.0028	12.5513
Elo:avg_elo(1750,2000]	0.0460	0.0021	22.2081

Maybe this is a problem that has already been addressed by Lichess, some bug that affected how Elo (Glicko2, really) was being computed, and is no longer an issue. I classify games by the year they were played, with 2021 as the reference class. Indeed we see now that the value of Elo is near 1, while it was much lower in 2014 and 2015. So my leading theory is that something in the computation was previously off, but has perhaps been fixed?

term	estimate	std.error	statistic	p.value
Tempo	1.72e+01	0.4000	42.480949	0.000000
Elo	9.68e-01	0.0026	372.815587	0.000000
Tempo:play_year2014	-1.68e+01	6.3000	-2.658131	0.007858
Tempo:play_year2015	-5.95e+00	0.7800	-7.625736	0.000000
Tempo:play_year2016	-4.94e+00	0.6100	-8.099793	0.000000
Tempo:play_year2017	3.09e-02	0.5800	0.053356	0.957449
Tempo:play_year2018	7.33e-01	0.5700	1.284990	0.198796
Tempo:play_year2019	1.67e+00	0.5700	2.958669	0.003090
Tempo:play_year2020	9.34e-04	0.5100	0.001816	0.998551
Elo:play_year2014	-9.07e-02	0.0400	-2.294870	0.021741
Elo:play_year2015	-1.03e-01	0.0046	-22.212239	0.000000
Elo:play_year2016	-8.32e-02	0.0036	-23.427809	0.000000
Elo:play_year2017	-3.15e-02	0.0035	-8.991584	0.000000
Elo:play_year2018	-3.76e-02	0.0035	-10.798132	0.000000
Elo:play_year2019	-2.48e-02	0.0035	-7.147420	0.000000
Elo:play_year2020	1.52e-02	0.0033	4.620739	0.000004

Going back to the original regression, if we now restrict our attention to matches in 2020 and later, we see that Elo seems well calibrated at longer time controls, and is perhaps even more so at shorter time controls.

term	estimate	std.error	statistic	p.value
Tempo	16.7000	0.3700	45.46794	0.000000
Elo	0.9700	0.0023	418.37293	0.000000
Tempo:time_control<=15	-0.0320	1.2000	-0.02778	0.977838
Tempo:time_control30-60	0.6470	0.5900	1.10287	0.270085
Tempo:time_control90-120	1.6400	0.6900	2.38268	0.017187
Elo:time_control<=15	0.0316	0.0076	4.17463	0.000030
Elo:time_control30-60	0.0118	0.0037	3.17373	0.001505
Elo:time_control90-120	0.0133	0.0044	3.05286	0.002267

One implication of this is that my study of piece values and opening values should be re-run with an adjustment for pre-game Elo from prior to 2020. I don't think this will have a huge effect on the outcomes, however.

Antichess Piece Values

Fri 10 September 2021 by Steven E. Pav

In a previous blog post I used logistic regression on games played data to estimate the piece value of pieces in Atomic chess. Since then I have been playing less Atomic and more Antichess. In Antichess, you win by losing all your pieces. To facilitate this, capturing is compulsory when possible; when multiple captures are possible you may select among them. There is no castling, and a pawn may promote to a king, but otherwise it is like traditional chess. (For more on antichess, I highly recommend Andrejić's book, The Ultimate Guide to Antichess.)

A king is a relatively powerful piece in Antichess: it can not as easily be turned into a "loose cannon", yet it can move in any direction. In general you want to keep your king on the board and remove your opponent's king. In that spirit, I wanted to estimate piece values in Antichess. I will use logistic regression for the analysis, as I did in my analysis of atomic chess.

For the analysis I pulled rated games data from Lichess. I wrote some code that will download and parse this data, turning it into a CSV file. I am sharing v1 of this file, but please remember that Lichess is the ultimate copyright holder.

The games in the dataset end in one of three conditions: Normal, Time forfeit, and Abandoned (game terminated before it began). The last category is very rare, and I omit these from my processing. The majority of games end in the Normal way, and I will consider only those. Also, games are played at various time controls, and players can make suboptimal moves when pressed for time, so I will restrict to games played with at least two minutes per side.

The game data includes Elo scores (well, Glicko scores, but …

Atomic Piece Values, Again

Mon 31 May 2021 by Steven E. Pav

In a previous blog post I used logistic regression to estimate the values of pieces in Atomic chess. In that study I computed material differences between the two players using a snapshot 8 plies before the end of the match. (A "ply" is a move by a single player.) That choice of snapshot was arbitrary, but it is typically late enough in the match so there is some material difference to measure, and also near enough to the end to estimate the "power" of each piece to bring victory. However, this valuation is rather late in the game, and is probably not representative of the average value of the pieces. That is, a knight advantage early in the game could be parlayed into a queen advantage later, which could then prove decisive.

To fix that issue, I will re-perform that analysis on other snapshots. Recall that I am working from 9 million rated Atomic games that I downloaded from Lichess. For each match I selected a pseudo-random ply after the second and before the last ply of each game, uniformly. (There is no material difference before the third ply.) I also selected pseudo-random snapshots in the first third, the second third, and the last third of each match. I compute the difference in material as well as differences in passed pawn counts for each snapshot. You can download v2 of the data, and the code.

Recall that I am using logistic regression to estimate coefficients in the model

$$ \operatorname{log}\left(\frac{p}{1-p}\right) = \frac{\operatorname{log}(10)}{400}\left[\Delta e + c_P \Delta P + c_K \Delta K + c_B \Delta B + c_R \Delta R + c_Q \Delta Q \right], $$

where $\Delta e$ is the difference in Elo, and $\Delta P, \Delta K, \Delta B, \Delta R, \Delta Q$ are the …

Atomic Piece Values

Mon 10 May 2021 by Steven E. Pav

Most chess playing computer programs use forward search over the tree of possible moves. Because such a search cannot examine every branch to termination of the game, usually "static" evaluation of leaf nodes in the tree is via the combination of a bunch of scoring rules. These typically include a term for the material balance of the position.
In traditional chess the pieces are usually assigned scores of 1 point for pawns, around 3 points for knights and bishops, 5 for rooks, and 9 for queens. Human players often use this heuristic when considering exchanges.

I recently started playing a chess variant called Atomic chess. In Atomic, when a piece captures another, both are removed from the board, along with all non-pawn pieces in the up to eight adjacent squares. The idea is that a capture causes an 'explosion'. Lichess plays a delightful explosion noise when this happens.

The traditional scoring heuristic is apparently based on mobility of the pieces. While movement of pieces is the same in the Atomic variant, I suspect that traditional scoring is not well calibrated for Atomic: A piece can capture only once in Atomic; a piece can remove multiple pieces from the board in one capture; pieces have value as protective 'chaff'; Kings cannot capture pieces, so solo mates are possible; pawns on the seventh rank can trap high-value pieces by threatening promotion; there are numerous fools' mates involving knights, etc. Can we create a scoring heuristic calibrated for Atomic?

The problem would seem intractable from first principles, because piece value is so different from average piece mobility. Instead, perhaps we can infer a kind of average value for pieces. In a previous blog post I performed a quick analysis of Atomic openings on a database of around 9 million games played on Lichess …

← Previous
1
2
3
4
5
6
7
8
9
10
11
12
13
Next →