ESTIMATING
DIFFERENCES in RATING

Browser engines vary somewhat in how they render MathML and I have been unable to compensate entirely. In particular, browsers that use Blink, such as Chromium, ignore ink when computing a box’s inline size (which causes spacing problems) and set the vertical offsets of some subscripts incorrectly. Browsers that use WebKit, such as Safari or any iOS browser, exhibit more serious problems.

I recommend reading this page with a browser that uses Gecko, such as Firefox, which will render the page correctly.

Sections method i Game Pairs & the Likelihood Function Sequential Probability Ratio Testing method ii Sequential Bayesian Testing method iii Matches & the Pentanomial Distribution The Posterior Distribution, Uniform Priors, & Use of the Posterior foundation Outcome Models Appendix

§ Game Pairs & the Likelihood Function

When games are scored 1–0, ½ – ½, or 0–1, there are five outcomes for the score of a game pair: 2–0, 1½ – ½, ..., and 0–2. Let’s number these outcomes 1, 2, ..., 5 and write $p_{1}$ , $p_{2}$ , ..., $p_{5}$ for the probabilities of each respectively. That is, the score of a game pair is a random variable X with a probability mass function $p (i) = p_{i}$ .

Given a particular outcome x and a model $m (𝜗) = ⟨ p_{1}, . . ., p_{5} ⟩$ of the score of a game pair as a function of rating difference $𝜗$ , the likelihood function is

ℒ_{x} (𝜗) = P (x | 𝜗) = m (𝜗) [x] .

For example, if the outcome is 1½ – ½, then $ℒ_{2} (𝜗) = p_{2}$ . Note that although $p (–)$ is a probability distribution, $ℒ_{x} (–)$ is not; in other words, $\sum p_{i} = 1$ for $i = 1, . . ., 5$ , whereas $\sum ℒ_{x} (𝜗)$ from $𝜗 = - \infty$ to $+ \infty$ can be any nonnegative value.

§ Sequential Probability Ratio Testing

Given two particular hypotheses, $H_{0} : 𝜗 = 𝜗_{0}$ and $H_{1} : 𝜗 = 𝜗_{1}$ , the likelihood ratio is

Λ (x) = \frac{ℒ_{x} (𝜗_{1})}{ℒ_{x} (𝜗_{0})} .

Commonly $𝜗_{0} = 0$ , so $H_{0}$ is the null hypothesis: “the players have the same rating”.

For a sequence $z_{1}$ , $z_{2}$ , ... of outcomes, define the series S with $S_{0} = 0$ and

S_{t + 1} = S_{t} + \log Λ (z_{t}) .

Then sprt is the following procedure:

Given two thresholds a and b (where $a < 0$ and $0 < b$ ), starting from time $t = 1$ , observe each outcome $z_{t}$ , calculate $S_{t}$ , and compare it to a and b.

While $a < S_{t} < b$ , continue observing.

When $S_{t} \leq a$ , reject $H_{1}$ and accept $H_{0}$ .

When $b \leq S_{t}$ , reject $H_{0}$ and accept $H_{1}$ .

The thresholds a and b that correspond to the probability $α$ of a false positive and the probability $β$ of a false negative are

a = \log \frac{β}{1 - α} and b = \log \frac{1 - β}{α} .

When $α =$ 0 . 05 and $β =$ 0 . 05 , their values are $a = -$ 2 . 94 and $b = +$ 2 . 94 .

to·do The generalized sequential probability ratio test deserves a section.

Sprt requires two hypotheses to compare, but we may only care that a player is rated higher at all, not that a player is rated higher by a particular amount (more likely to be rated higher by that amount than being rated equal, anyway). It is also agnostic to any prior beliefs about the rating difference, assuming implicitly that one hypothesis is (initially) just as plausible as the other.

As far as I am aware, this first concern is not significant; sprt is perfectly sufficient, although the results of testing may be difficult to interpret. The second concern is rare, as our knowledge prior to testing is often relatively weak. However, when we do want results that are more directly interpretable or we do have some significant bias, we can use another method.

§ Sequential Bayesian Testing

to-do Sequential Bayesian analysis.

§ Matches & the Pentanomial Distribution

For any random variable with five possible outcomes, where $p_{i}$ is the probability of outcome i, the pentanomial distribution gives the probability of sampling the variable n times and observing each of the outcomes $x_{i}$ times:

f (⟨ x_{1}, . . ., x_{5} ⟩, ⟨ p_{1}, . . ., p_{5} ⟩) = \frac{n!}{x_{1}! · · · x_{5}!} \times p_{1}^{x_{1}} · · · p_{5}^{x_{5}}

Instead of representing the score of a single game pair, X now represents the score of a match, and an outcome x is now a collection $⟨ x_{1}, . . ., x_{5} ⟩$ . For example, $x_{1}$ is the number of game pairs where the outcome was 2–0.

Given a particular match outcome x and a model $m (𝜗) = ⟨ p_{1}, . . ., p_{5} ⟩$ of the score of a game pair as a function of rating difference $𝜗$ ,

P (x | 𝜗) = f (x, m (𝜗)) .

§ The Posterior Distribution

Given a particular match outcome x, we can find the probability

P (𝜗 | x) = P (x | 𝜗) \times P (𝜗) / P (x),

or more explicitly,

P (Θ = 𝜗 | x) = P (x | Θ = 𝜗) \times P (Θ = 𝜗) / P (x),

where

P (x) = \sum_{ρ} P (x | Θ = ρ) \times P (Θ = ρ),

leaving only the prior $P (𝜗)$ to be determined.

$P (𝜗 | x)$ is the probability that the difference in rating is $𝜗$ given the observed match outcome. Unlike $ℒ_{x} (–) = P (x | –)$ , the function $P (– | x)$ is a distribution.

§ Uniform Priors

We can use as a prior a uniform distribution

P (𝜗) = 1 / (b - a + 1) if a \leq 𝜗 \leq b

P (𝜗) = 0 otherwise

over some interval $[a, b]$ , and in particular, we can set $a = - L / 2$ and $b = + L / 2$ for some L so that $P (𝜗) = 1 / (L + 1) = λ$ over the support.

Then

P (𝜗 | x) = P (x | 𝜗) \times λ / P (x)

P (x) = {\sum_{ρ}}_{- L / 2}^{+ L / 2} P (x | ρ) \times λ

and $λ$ cancels, giving us

P (𝜗 | x) = P (x | 𝜗) / {\sum_{ρ}}_{- L / 2}^{+ L / 2} P (x | ρ)

which is simply $P (x | 𝜗)$ normalized to have unit area over the support.

We can then consider the limit

lim_{L \to \infty} {\sum_{ρ}}_{- L / 2}^{+ L / 2} P (x | ρ) .

This converges for the outcome models given below, so ${lim}_{L \to \infty} P (𝜗 | x)$ exists. And in this sense, then, it is reasonable to talk about having a “uniform prior over all the integers”.

to·do Conditions for convergence and a proof of convergence deserve an appendix.

It is, of course, not necessary to use a uniform prior, but in the absence of any reason to assume a different prior it is perhaps the best default.

§ Use of the Posterior

Perhaps more useful than $P (𝜗 | x)$ is its cumulative distribution function,

C (t) = P (Θ < t | x) = {\sum_{ρ}}_{- \infty}^{t - 1} P (Θ = ρ | x) .

As an alternative to sprt, one might stop a test when the probability that the rating difference is negative, $C (0)$ , is less than some threshold (say, 5% to accept a change) or greater than some threshold (say, 95% to reject a change).

to-do This section needs further elaboration.

to-do A better stopping condition might be when the dispersion is low enough (say, the variance, or the width $𝜗_{hi} - 𝜗_{lo}$ of the hdi where $P (𝜗_{lo}) = P (𝜗_{hi}) = 0.05$ , or the width $𝜗_{hi} - 𝜗_{lo}$ where $C (𝜗_{lo}) = 0.05$ and $C (𝜗_{hi}) = 0.95$ ).

§ Outcome Models

For a game between two players, a and b, let’s write w for the probability of a win for player a, and d for the probability of a draw, and l for the probability of a loss for player a, so that $w + d + l = 1$ . The expected composite score for player a is $s = w + ½ d$ .

For our purposes, a rating system is a map from

a rating difference $𝜗$
zero or more parameters that characterize the conditions of a game

to an expected composite score s.

A game model is a map from

a rating difference $𝜗$
zero or more parameters that characterize the conditions of a game

to a distribution $⟨ w, d, l ⟩$ .

A game model can induce a rating system because s is determined by w and d, but the reverse does not hold; w and d cannot be teased apart from s alone. But in any case, we usually think of rating systems as existing independently of game models.

In the Elo rating system, the expected composite score of a player rated $𝜗$ points higher than their opponent is $s_{e} (𝜗) = 1 / (1 + 10^{- 𝜗 / 400})$ . This ignores, for example, the drawishness of an opening, any advantage conferred by an opening, and the effect of time control; an Elo rating therefore indicates a player’s performance averaged over some particular gamut of conditions.

For a rating system, this simplicity is perfectly alright so long as we are aware of the conditions for which a rating is applicable. A game model, however, is greatly improved by additional parameters, because they allow us to account for variation within the gamut to estimate a player’s rating more quickly.

The logistic function $ſ (x) = 1 / (1 + 10^{- x / 400})$ happens to be the only component of the Elo rating system, $s_{e} (𝜗) = ſ (𝜗)$ , but it also appears in the models presented in the following sections, for which $s (𝜗) \neq ſ (𝜗)$ , so we’ll give it a separate symbol, $ſ$ .

Conventionally, an evaluation of +100 centipawn (where “centipawn” is an engine unit) corresponds to an expectation of a composite score of ¾ , implying that a 100 centipawn advantage ought to be equivalent to a difference of about 191 Elo points:

1 / (1 + 10^{- 𝜗 / 400}) = 3 / 4

1 + 10^{- 𝜗 / 400} = 4 / 3

10^{- 𝜗 / 400} = 1 / 3

- 𝜗 / 400 = - \log_{10} 3

𝜗 = 100 \times 4 \log_{10} 3 \approx 191

In fact, we might simply define the metric for opening advantage to be identical to rating difference, but scaled so that 1 unit of advantage is equivalent to 1.91 rating points, and also name this unit a “centipawn”.

This allows us to account for opening advantage when using a rating system or game model that does not otherwise consider it. For example, from the Elo rating system $s_{e} (𝜗)$ we could construct a rating system $s_{?} (𝜗, 𝜉)$ where the expected composite score is

s_{?} (𝜗, 𝜉) = s_{e} (𝜗 + 1.91 \times 𝜉) = ſ (𝜗 + 1.91 \times 𝜉)

when the opening advantage is $𝜉$ centipawn as just defined.

Rather than taking this perspective, however, let’s continue to use $s_{e}$ and instead fold the opening advantage into the rating difference, and let’s refer to $𝜗 + 1.91 \times 𝜉$ as the effective rating difference, $𝜗^{⋆}$ .

A better rating system might depend upon $𝜉$ in a more complicated way. We defined the metric for opening advantage as we did for expediency, but it would probably be much preferable for $s (0, 𝜉)$ to closely follow the relationship between an engine’s evaluation and expected composite score, because querying an engine is the most straightforward way to estimate the advantage of an opening position. The metric we are adopting here, however, may only agree with an engine at $s (0, + 100) = ¾$ .

Models for Individual Games

The following models have two parameters, $d_{0}$ and q .

The parameter $d_{0}$ is the probability of a draw when the effective rating difference is zero, so $d (0) = d_{0}$ .
The parameter q is the effective rating difference for which the probability of a win is ½, so $w (q) = ½$ .

Note that q is not the effective rating difference for which the expected composite score $s = w + ½ d$ is ¾ . For any particular value of $d_{0}$ , however, the parameter q can be adjusted so that, for example, an effective rating difference of +191 does map exactly to an expected composite score of ¾ .

In the formulas for these two models, instead of writing “ $𝜗^{⋆}$ ” like we ought, let’s write “ $𝜗$ ” to make things easier to read.

Rao-Kupper Model

= 400 \log_{10} \frac{1 + d_{0}}{1 - d_{0}}

w (𝜗)

= ſ (+ \frac{c}{q} (𝜗 - q))

l (𝜗)

= ſ (- \frac{c}{q} (𝜗 + q))

d (𝜗)

= 1 - w (𝜗) - l (𝜗)

Davidson Model

= \frac{d_{0}}{1 - d_{0}}

= 400 \times 2 \log_{10} (u + \sqrt{u^{2} + 1})

= 2 u \times \sqrt{ſ (+ \frac{c}{q} 𝜗) ſ (- \frac{c}{q} 𝜗)}

w (𝜗)

= ſ (+ \frac{c}{q} 𝜗) / (1 + k)

l (𝜗)

= ſ (- \frac{c}{q} 𝜗) / (1 + k)

d (𝜗)

= k / (1 + k)

You can compare and interact with these models on Desmos.

Models for Pairs of Games

to-do This section needs to be finished.

For any game model $wdl (𝜗) = ⟨ w (𝜗), d (𝜗), l (𝜗) ⟩$ , there is a game pair model.

⟨ w^{+}, d^{+}, l^{+} ⟩ = wdl (𝜗 + 𝜉)

⟨ w^{–}, d^{–}, l^{–} ⟩ = wdl (𝜗 - 𝜉)

For a game pair where the players alternate, let’s write $⟨ w^{+}, d^{+}, l^{+} ⟩$ for when a is the first player and $⟨ w^{–}, d^{–}, l^{–} ⟩$ for when a is the second player.

Then the probability of each outcome of the game pair is

$p_{1}$ $= w^{+} w^{–}$ $p_{2}$ $= w^{+} d^{–} + d^{+} w^{–}$ $p_{3}$ $= w^{+} l^{–} + d^{+} d^{–} + l^{+} w^{–}$ $p_{4}$ $= d^{+} l^{–} + l^{+} d^{–}$ $p_{5}$ $= l^{+} l^{–} .$

Given a constant $𝜉$ , this is model $m (𝜗) = ⟨ p_{1}, . . ., p_{5} ⟩$ in the first and second sections.

This is a little weird, b/c it’s unlikely every opening in the book has the same advantage, but we approximate by taking the average advantage.

Rating estimate bias from sprt

to-do Discuss the bias of Elo estimates from sprt, that is, estimating $P (𝜗)$ from $P (𝜗 | succeeded)$ or $P (𝜗 | failed)$ .

§ Appendix

The pentanomial distribution is difficult to compute exactly when n is large because the factorials grow very quickly.

\frac{n!}{x_{1}! · · · x_{5}!} \times p_{1}^{x_{1}} · · · p_{5}^{x_{5}}

Instead, we can rewrite this expression as

\exp (\log n! - \log x_{1}! - \dots - \log x_{5}! + x_{1} \log p_{1} + \dots + x_{5} \log p_{5})

and then approximate factorial as

n! \sim \sqrt{τ (n + ⅙)} {(\frac{n}{e})}^{n}

so that

\log n! \sim n \log n - n + ½ \log τ + ½ \log (n + ⅙) .

We can then write a subroutine like the following:

define multinomial⟨k⟩(x : Array⟨u64, k⟩, p : Array⟨f64, k⟩)
  mutable log-power : f64
  mutable int-coeff : u64
  mutable tau-coeff : u64
  mutable log-sqrt  : f64

  n = sum(x)
  if n > 20
    ※ n! is large so we approximate
    log-power = n as f64 × log(n as f64)
    int-coeff = −n
    tau-coeff = 1
    log-sqrt  = log(n as f64 + 1/6)
  else
    ※ n! is less than 2⁶⁴
    log-power = log(factorial(n) as f64)
    int-coeff = 0
    tau-coeff = 0
    log-sqrt  = 0
  end

  for i in 0..k
    if x[i] > 20
      log-power −= x[i] as f64 × log(x[i] as f64)
      int-coeff += x[i]
      tau-coeff −= 1
      log-sqrt  −= log(x[i] as f64 + 1/6)
    else
      log-power −= log(factorial(x[i]) as f64)
    end
  end

  for i in 0..k
    log-power += x[i] as f64 × log(p[i])
  end

  terms = log-power
        + int-coeff
        + tau-coeff × 1/2 × log(tau)
        + 1/2 × log-sqrt

  return exp(terms)
end

§ Further Reading

Statistical Methods and Algorithms in Fishtest
Stockfish Testing Framework sprt Calculator
Generalized Sequential Probability Ratio Test
Normalized Elo
Comments on Normalized Elo
From the Draw Ratio to Normalized Elo
The Accounting Identity
Bayesian Elo Rating
The Match Score as a Statistic for Comparing Engines
Modeling Game Outcomes in Chess
The Stockfish wdl Model

ESTIMATINGDIFFERENCES in RATING