ESTIMATING
DIFFERENCES in RATING

Browser engines vary somewhat in how they render MathML and I have been unable to compensate entirely. In particular, browsers that use Blink (such as Chromium) ignore ink when computing a box’s inline size, which causes spacing problems, and the vertical offsets of some subscripts are incorrect. Browsers that use WebKit (such as Safari) exhibit more serious problems, but I do not have access to a browser with the developer tools to effectively debug them.

I recommend reading this page with a browser that uses Gecko (such as Firefox) , which will render the page correctly.

Sections method i Game Pairs & the Likelihood Function Sequential Probability Ratio Testing method ii Matches & the Pentanomial Distribution The Posterior Distribution, Uniform Priors, & Use of the Posterior method iii Sequential Bayesian Testing foundation Outcome Models Appendix

§ Game Pairs & the Likelihood Function

When games are scored 1–0, ½ ½, or 0–1, there are five outcomes for the score of a game pair: 2–0, ½, ..., and 0–2. Let’s number these outcomes 1, 2, ..., 5 and write p 1 , p 2 , ..., p 5 for the expectations of each respectively. That is, the score of a game pair is a random variable X with a probability mass function p ( i ) = p i .

Given a particular outcome x and a model m ( 𝜗 ) = p 1 , . . . , p 5 of the score of a game pair as a function of rating difference 𝜗 , the likelihood function is

x ( 𝜗 ) = P ( x | 𝜗 ) = m ( 𝜗 ) [ x ] .

For example, if the outcome is ½, then 2 ( 𝜗 ) = p 2 . Note that although p ( ) is a probability distribution, x ( ) is not; in other words, p i = 1 for i = 1 , . . . , 5 , whereas x ( 𝜗 ) from 𝜗 = to + can be any nonnegative value.

§ Sequential Probability Ratio Testing

Given two particular hypotheses, H 0 : 𝜗 = 𝜗 0 and H 1 : 𝜗 = 𝜗 1 , the likelihood ratio is

Λ ( x ) = x ( 𝜗 1 ) x ( 𝜗 0 ) .

Commonly 𝜗 0 = 0 , so H 0 is the null hypothesis: “the players have the same rating”.

For a sequence z 1 , z 2 , ... of outcomes, define the series S with S 0 = 0 and

S t + 1 = S t + log Λ ( z t ) .

Then sprt is the following procedure:

Given two thresholds a and b (where a < 0 and 0 < b ), starting from time t = 1 , observe each outcome z t , calculate S t , and compare it to a and b.

  • While a < S t < b , continue observing.
  • When S t a , reject H 1 and accept H 0 .
  • When b S t , reject H 0 and accept H 1 .

The thresholds a and b that correspond to the probability α of a false positive and the probability β of a false negative are

a = log β 1 α and b = log 1 β α .

When α = 0 . 05 and β = 0 . 05 , their values are a = 2 . 94 and b = + 2 . 94 .

to·do The generalized sequential probability ratio test deserves a section.

Sprt requires two hypotheses to compare, but we may only care that a player is rated higher at all, not that a player is rated higher by a particular amount (more likely to be rated higher by that amount than being rated equal, anyway). It is also agnostic to any prior beliefs about the rating difference, assuming implicitly that one hypothesis is (initially) just as plausible as the other.

As far as I am aware, this first concern is not significant; sprt is perfectly sufficient, although the results of testing may be difficult to interpret. The second concern is rare, as our knowledge prior to testing may be relatively weak. However, when we do want results that are more directly interpretable or we do have some significant bias, we can use another method.

§ Matches & the Pentanomial Distribution

For any random variable with five possible outcomes, where p i is the probability of outcome i, the pentanomial distribution gives the prob­ability of sampling the vari­able n times and observing each of the outcomes x i times:

f ( x 1 , . . . , x 5 , p 1 , . . . , p 5 ) = n ! x 1 ! · · · x 5 ! × p 1 x 1 · · · p 5 x 5

Instead of representing the score of a single game pair, X now represents the score of a match, and an outcome x is now a collection x 1 , . . . , x 5 . For example, x 1 is the number of game pairs where the outcome was 2–0.

Given a particular match outcome x and a model m ( 𝜗 ) = p 1 , . . . , p 5 of the score of a game pair as a function of rating difference 𝜗 ,

P ( x | 𝜗 ) = f ( x , m ( 𝜗 ) ) .

§ The Posterior Distribution

Given a particular match outcome x, we can find the probability

P ( 𝜗 | x ) = P ( x | 𝜗 ) × P ( 𝜗 ) / P ( x ) ,

or more explicitly,

P ( Θ = 𝜗 | x ) = P ( x | Θ = 𝜗 ) × P ( Θ = 𝜗 ) / P ( x ) ,

where

P ( x ) = ρ P ( x | Θ = ρ ) × P ( Θ = ρ ) ,

leaving only the prior P ( 𝜗 ) to be determined.

P ( 𝜗 | x ) is the probability that the diff­er­ence in rating is 𝜗 given the observed match outcome. Unlike x ( ) = P ( x | ) , the function P ( | x ) is a distribution.

§ Uniform Priors

We can use as a prior a uniform distribution

P ( 𝜗 ) = 1 / ( b a + 1 ) if a 𝜗 b
P ( 𝜗 ) = 0 otherwise

over some interval [ a , b ] , and in particular, we can set a = L / 2 and b = + L / 2 for some L so that P ( 𝜗 ) = 1 / ( L + 1 ) = λ over the support.

Then

P ( 𝜗 | x ) = P ( x | 𝜗 ) × λ / P ( x )
P ( x ) = ρ L / 2 + L / 2 P ( x | ρ ) × λ

and λ cancels, giving us

P ( 𝜗 | x ) = P ( x | 𝜗 ) / ρ L / 2 + L / 2 P ( x | ρ )

which is simply P ( x | 𝜗 ) normalized to have unit area over the support.

We can then consider the limit

lim L ρ L / 2 + L / 2 P ( x | ρ ) .

This converges for the outcome models given below, so lim L P ( 𝜗 | x ) exists. And in this sense, then, it is reasonable to talk about having a “uniform prior over all the integers”.

to·do Conditions for convergence and a proof of convergence deserve an appendix.

It is, of course, not necessary to use a uniform prior, but in the absence of any reason to assume a different prior it is perhaps the best default.

§ Use of the Posterior

Perhaps more useful than P ( 𝜗 | x ) is its cumulative distribution function,

C ( t ) = P ( Θ < t | x ) = ρ t 1 P ( Θ = ρ | x ) .

As an alternative to sprt, one might stop a test when the probability that the rating difference is negative, C ( 0 ) , is less than some threshold (say, 5% to accept a change) or greater than some threshold (say, 95% to reject a change).

to-do This section needs further elaboration.

to-do A better stopping condition might be when the dispersion is low enough (say, the variance, or the width 𝜗 hi 𝜗 lo of the hdi where P ( 𝜗 lo ) = P ( 𝜗 hi ) = 0.05 , or the width 𝜗 hi 𝜗 lo where C ( 𝜗 lo ) = 0.05 and C ( 𝜗 hi ) = 0.95 ).

§ Sequential Bayesian Testing

to-do Sequential Bayesian analysis.

§ Outcome Models

For a game between two players, a and b, let’s write w for the expectation of a win for player a, and d for the expectation of a draw, and l for the expectation of a loss for player a, so that w + d + l = 1 . The expected composite score for player a is s = w + ½ d .

For our purposes, a rating system is a map from

to an expected composite score s.

A game model is a map from

to a distribution w , d , l .

A game model can induce a rating system because s is determined by w and d, but the reverse does not hold; w and d cannot be teased apart from s alone. But in any case, we usually think of rating systems as existing independently of game models.

In the Elo rating system, the expected composite score of a player rated 𝜗 points higher than their opponent is 1 / ( 1 + 10 𝜗 / 400 ) . This ignores, for example, the drawishness of an opening, any advantage conferred by an opening, and the effect of time control; an Elo rating therefore indicates a player’s performance averaged over some particular gamut of conditions.

For a rating system, this simplicity is perfectly alright so long as we are aware of the conditions for which a rating is applicable. A game model, however, is greatly improved by additional parameters, because they allow us to account for variation within the gamut to estimate a player’s rating more quickly.

The logistic function ſ ( x ) = 1 / ( 1 + 10 x / 400 ) happens to be the only component of the Elo rating system, s e ( 𝜗 ) = ſ ( 𝜗 ) , but it also appears in the models presented in the following sections, for which s ( 𝜗 ) ſ ( 𝜗 ) , and thus it is given a separate name.

Conventionally, an evaluation of +100 centipawn (where “centipawn” is an engine unit) corresponds to an expectation of a composite score of ¾ , implying that a 100 centipawn advantage ought to be equiv­alent to a difference of about 191 Elo points:

1 / ( 1 + 10 𝜗 / 400 ) = 3 / 4
1 + 10 𝜗 / 400 = 4 / 3
10 𝜗 / 400 = 1 / 3
𝜗 / 400 = log 10 3
𝜗 = 100 × 4 log 10 3 191

In fact, we might simply define the metric for opening advantage to be identical to rating difference, but scaled so that 1 unit of advantage is equivalent to 1.91 rating points, and also name this unit a “centipawn”.

This allows us to account for opening advantage when using a rating system or game model that does not otherwise consider it. For example, from the Elo rating system s e ( 𝜗 ) we could construct a rating system s e + ( 𝜗 , 𝜉 ) where the expected com­posite score is

s e + ( 𝜗 , 𝜉 ) = s e ( 𝜗 + 1.91 × 𝜉 ) = ſ ( 𝜗 + 1.91 × 𝜉 )

when the opening advantage is 𝜉 centipawn as just defined.

Rather than taking this perspective, however, let’s continue to use s e ( ) and instead fold the opening advantage into the rating difference, and let’s refer to 𝜗 + 1.91 × 𝜉 as the effective rating difference, 𝜗 .

A better rating system might depend upon 𝜉 in a more complicated way. We defined the metric for opening advantage as we did for expediency, but it would probably be much preferable for s ( 0 , 𝜉 ) to closely follow the relationship between an engine’s evaluation and expected composite score, because querying an engine is the most straightforward way to estimate the advantage of an opening position. The metric we are adopting here, however, may only agree with an engine at s ( 0 , + 100 ) = ¾ .

Models for Individual Games

The following models have two parameters, d 0 and q .

Note that q is not the effective rating difference for which the expected composite score s = w + ½ d is ¾ . For any particular value of d 0 , however, the parameter q can be adjusted so that, for example, an effective rating difference of +191 does map exactly to an expected composite score of ¾ .

In the formulas for these two models, instead of writing “ 𝜗 ” like we ought, let’s write “ 𝜗 ” to make things easier to read.

Rao-Kupper Model

c = 400 log 10 1 + d 0 1 d 0 w ( 𝜗 ) = ſ ( + c q ( 𝜗 q ) ) l ( 𝜗 ) = ſ ( c q ( 𝜗 + q ) ) d ( 𝜗 ) = 1 w ( 𝜗 ) l ( 𝜗 )

Davidson Model

u = d 0 1 d 0 c = 400 × 2 log 10 ( u + u 2 + 1 ) k = 2 u × ſ ( + c q 𝜗 ) ſ ( c q 𝜗 ) w ( 𝜗 ) = ſ ( + c q 𝜗 ) / ( 1 + k ) l ( 𝜗 ) = ſ ( c q 𝜗 ) / ( 1 + k ) d ( 𝜗 ) = k / ( 1 + k )

You can compare and interact with these models on Desmos.

Models for Pairs of Games

to-do This section needs to be finished.

For any game model, there is a game pair model.

w 1 , d 1 , l 1 = wdl ( 𝜗 + 𝜉 )
w 2 , d 2 , l 2 = wdl ( 𝜗 𝜉 )

For a game pair where the players alternate (playing once as white/sente and once as black/gote), let’s write w 1 , d 1 , l 1 for when a is the first player and w 2 , d 2 , l 2 for when a is the second player.

Then the expectation of each outcome of the game pair is

p 1 = w 1 w 2 p 2 = w 1 d 2 + d 1 w 2 p 3 = w 1 l 2 + d 1 d 2 + l 1 w 2 p 4 = d 1 l 2 + l 1 d 2 p 5 = l 1 l 2 .

Given a constant 𝜉 , this is model m ( 𝜗 ) = p 1 , . . . , p 5 in the first and second sections.

This is a little weird, b/c it’s unlikely every opening in the book has the same advantage, but we approximate by taking the average advantage.

§ Appendix

The pentanomial distribution is difficult to compute exactly when n is large because the factorials grow very quickly.

n ! x 1 ! · · · x 5 ! × p 1 x 1 · · · p 5 x 5

Instead, we can rewrite this expression as

exp ( log n ! log x 1 ! log x 5 ! + x 1 log p 1 + + x 5 log p 5 )

and then approximate factorial as

n ! τ ( n + ) ( n e ) n

so that

log n ! n log n n + ½ log τ + ½ log ( n + ) .

We can then write a subroutine like the following:

define multinomial(x : Array(u64), p : Array(f64))
  mutable log-power : f64
  mutable int-coeff : u64
  mutable tau-coeff : u64
  mutable log-sqrt  : f64

  n = sum(x)
  if n > 20
    ※ n! is large so we approximate
    log-power = n as f64 × log(n as f64)
    int-coeff = n
    tau-coeff = 1
    log-sqrt  = log(n as f64 + 1/6)
  else
    ※ n! is less than 2⁶⁴
    log-power = log(factorial(n) as f64)
    int-coeff = 0
    tau-coeff = 0
    log-sqrt  = 0
  end

  for i in 0, ..., length(x)  1
    if x[i] > 20
      log-power <- log-power  x[i] as f64 × log(x[i] as f64)
      int-coeff <- int-coeff + x[i]
      tau-coeff <- tau-coeff  1
      log-sqrt  <- log-sqrt   log(x[i] as f64 + 1/6)
    else
      log-power <- log-power  log(factorial(x[i]) as f64)
    end
    log-power <- log-power + x[i] as f64 × log(p[i])
  end

  terms = log-power
        + int-coeff
        + tau-coeff × 1/2 × log(tau)
        + 1/2 × log-sqrt

  return exp(terms)
end

§ Further Reading

Statistical Methods and Algorithms in Fishtest
Stockfish Testing Framework sprt Calculator
Generalized Sequential Probability Ratio Test
Normalized Elo
Comments on Normalized Elo
From the Draw Ratio to Normalized Elo
The Accounting Identity
Bayesian Elo Rating
The Match Score as a Statistic for Comparing Engines
Modeling Game Outcomes in Chess
The Stockfish wdl Model