Browser engines vary somewhat in how they render MathML and I have been unable to compensate entirely. In particular, browsers that use Blink (such as Chromium) ignore ink when computing a box’s inline size, which causes spacing problems, and the vertical offsets of some subscripts are incorrect. Browsers that use WebKit (such as Safari) exhibit more serious problems, but I do not have access to a browser with the developer tools to effectively debug them.
I recommend reading this page with a browser that uses Gecko (such as Firefox) , which will render the page correctly.
Sections method i Game Pairs & the Likelihood Function Sequential Probability Ratio Testing method ii Matches & the Pentanomial Distribution The Posterior Distribution, Uniform Priors, & Use of the Posterior method iii Sequential Bayesian Testing foundation Outcome Models Appendix
When games are scored 1–0, ½ – ½, or 0–1, there are five outcomes for the score of a game pair: 2–0, 1½ – ½, ..., and 0–2. Let’s number these outcomes 1, 2, ..., 5 and write , , ..., for the expectations of each respectively. That is, the score of a game pair is a random variable X with a probability mass function .
Given a particular outcome x and a model of the score of a game pair as a function of rating difference , the likelihood function is
For example, if the outcome is 1½ – ½, then . Note that although is a probability distribution, is not; in other words, for , whereas from to can be any nonnegative value.
Given two particular hypotheses, and , the likelihood ratio is
Commonly , so is the null hypothesis: “the players have the same rating”.
For a sequence , , ... of outcomes, define the series S with and
Then sprt is the following procedure:
Given two thresholds a and b (where and ), starting from time , observe each outcome , calculate , and compare it to a and b.
- While , continue observing.
- When , reject and accept .
- When , reject and accept .
The thresholds a and b that correspond to the probability of a false positive and the probability of a false negative are
When 0 . 05 and 0 . 05 , their values are 2 . 94 and 2 . 94 .
to·do The generalized sequential probability ratio test deserves a section.
Sprt requires two hypotheses to compare, but we may only care that a player is rated higher at all, not that a player is rated higher by a particular amount (more likely to be rated higher by that amount than being rated equal, anyway). It is also agnostic to any prior beliefs about the rating difference, assuming implicitly that one hypothesis is (initially) just as plausible as the other.
As far as I am aware, this first concern is not significant; sprt is perfectly sufficient, although the results of testing may be difficult to interpret. The second concern is rare, as our knowledge prior to testing may be relatively weak. However, when we do want results that are more directly interpretable or we do have some significant bias, we can use another method.
For any random variable with five possible outcomes, where is the probability of outcome i, the pentanomial distribution gives the probability of sampling the variable n times and observing each of the outcomes times:
Instead of representing the score of a single game pair, X now represents the score of a match, and an outcome x is now a collection . For example, is the number of game pairs where the outcome was 2–0.
Given a particular match outcome x and a model of the score of a game pair as a function of rating difference ,
Given a particular match outcome x, we can find the probability
or more explicitly,
where
leaving only the prior to be determined.
is the probability that the difference in rating is given the observed match outcome. Unlike , the function is a distribution.
We can use as a prior a uniform distribution
over some interval , and in particular, we can set and for some L so that over the support.
Then
and cancels, giving us
which is simply normalized to have unit area over the support.
We can then consider the limit
This converges for the outcome models given below, so exists. And in this sense, then, it is reasonable to talk about having a “uniform prior over all the integers”.
to·do Conditions for convergence and a proof of convergence deserve an appendix.
It is, of course, not necessary to use a uniform prior, but in the absence of any reason to assume a different prior it is perhaps the best default.
Perhaps more useful than is its cumulative distribution function,
As an alternative to sprt, one might stop a test when the probability that the rating difference is negative, , is less than some threshold (say, 5% to accept a change) or greater than some threshold (say, 95% to reject a change).
to-do This section needs further elaboration.
to-do A better stopping condition might be when the dispersion is low enough (say, the variance, or the width of the hdi where , or the width where and ).
to-do Sequential Bayesian analysis.
For a game between two players, a and b, let’s write w for the expectation of a win for player a, and d for the expectation of a draw, and l for the expectation of a loss for player a, so that . The expected composite score for player a is .
For our purposes, a rating system is a map from
to an expected composite score s.
A game model is a map from
to a distribution .
A game model can induce a rating system because s is determined by w and d, but the reverse does not hold; w and d cannot be teased apart from s alone. But in any case, we usually think of rating systems as existing independently of game models.
In the Elo rating system, the expected composite score of a player rated points higher than their opponent is . This ignores, for example, the drawishness of an opening, any advantage conferred by an opening, and the effect of time control; an Elo rating therefore indicates a player’s performance averaged over some particular gamut of conditions.
For a rating system, this simplicity is perfectly alright so long as we are aware of the conditions for which a rating is applicable. A game model, however, is greatly improved by additional parameters, because they allow us to account for variation within the gamut to estimate a player’s rating more quickly.
The logistic function happens to be the only component of the Elo rating system, , but it also appears in the models presented in the following sections, for which , and thus it is given a separate name.
Conventionally, an evaluation of +100 centipawn (where “centipawn” is an engine unit) corresponds to an expectation of a composite score of ¾ , implying that a 100 centipawn advantage ought to be equivalent to a difference of about 191 Elo points:
In fact, we might simply define the metric for opening advantage to be identical to rating difference, but scaled so that 1 unit of advantage is equivalent to 1.91 rating points, and also name this unit a “centipawn”.
This allows us to account for opening advantage when using a rating system or game model that does not otherwise consider it. For example, from the Elo rating system we could construct a rating system where the expected composite score is
when the opening advantage is centipawn as just defined.
Rather than taking this perspective, however, let’s continue to use and instead fold the opening advantage into the rating difference, and let’s refer to as the effective rating difference, .
A better rating system might depend upon in a more complicated way. We defined the metric for opening advantage as we did for expediency, but it would probably be much preferable for to closely follow the relationship between an engine’s evaluation and expected composite score, because querying an engine is the most straightforward way to estimate the advantage of an opening position. The metric we are adopting here, however, may only agree with an engine at .
The following models have two parameters, and q .
Note that q is not the effective rating difference for which the expected composite score is ¾ . For any particular value of , however, the parameter q can be adjusted so that, for example, an effective rating difference of +191 does map exactly to an expected composite score of ¾ .
In the formulas for these two models, instead of writing “” like we ought, let’s write “” to make things easier to read.
You can compare and interact with these models on Desmos.
to-do This section needs to be finished.
For any game model, there is a game pair model.
For a game pair where the players alternate (playing once as white/sente and once as black/gote), let’s write for when a is the first player and for when a is the second player.
Then the expectation of each outcome of the game pair is
Given a constant , this is model in the first and second sections.
This is a little weird, b/c it’s unlikely every opening in the book has the same advantage, but we approximate by taking the average advantage.
The pentanomial distribution is difficult to compute exactly when n is large because the factorials grow very quickly.
Instead, we can rewrite this expression as
and then approximate factorial as
so that
We can then write a subroutine like the following:
define multinomial(x : Array(u64), p : Array(f64)) mutable log-power : f64 mutable int-coeff : u64 mutable tau-coeff : u64 mutable log-sqrt : f64 n = sum(x) if n > 20 ※ n! is large so we approximate log-power = n as f64 × log(n as f64) int-coeff = −n tau-coeff = 1 log-sqrt = log(n as f64 + 1/6) else ※ n! is less than 2⁶⁴ log-power = log(factorial(n) as f64) int-coeff = 0 tau-coeff = 0 log-sqrt = 0 end for i in 0, ..., length(x) − 1 if x[i] > 20 log-power <- log-power − x[i] as f64 × log(x[i] as f64) int-coeff <- int-coeff + x[i] tau-coeff <- tau-coeff − 1 log-sqrt <- log-sqrt − log(x[i] as f64 + 1/6) else log-power <- log-power − log(factorial(x[i]) as f64) end log-power <- log-power + x[i] as f64 × log(p[i]) end terms = log-power + int-coeff + tau-coeff × 1/2 × log(tau) + 1/2 × log-sqrt return exp(terms) end
Statistical Methods and Algorithms in Fishtest
Stockfish Testing Framework sprt Calculator
Generalized Sequential Probability Ratio Test
Normalized Elo
Comments on Normalized Elo
From the Draw Ratio to Normalized Elo
The Accounting Identity
Bayesian Elo Rating
The Match Score as a Statistic for Comparing Engines
Modeling Game Outcomes in Chess
The Stockfish wdl Model