The Math Behind ELO

If you’ve ever participated in some form of ranked sport or video game, chances are that you have heard about the ELO Rating System. Originally designed for ranking professional chess players, it has been adopted to many other domains such as college football, Scrabble, and video games. ELO’s popularity has been primarily driven by its impressive accuracy considering its simplicity, making it an ideal solution for someone that wants to add a ranking system to their own projects. even the initial version of the “Facebook Algorithm” was implemented using ELO!

elo-facebook.jpg


I’ll give a brief overview of the ELO algorithm, its applications, and how it can be extended to better model the skill distributions of various playerbases.


The Algorithm #

ELO is generally used to rank a set of players by using a set of historical match results. Suppose we have 4 competitive ping-pong players; Amy, Brad, Cindy, and Dirk.

Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 1200 1200 1200 1200
Amy Brad 20 1220 1180 1200 1200
Dirk Cindy 20 1220 1180 1180 1220
Amy Cindy 18 1238 1180 1162 1220
Dirk Cindy 17 1238 1180 1162 1237

Above, the ELO Change column is the result that ELO algorithm generates for us. Given three values - PlayerA ELO, PlayerB ELO, and PlayerA’s Score (whether they win or lost) - the algorithm calculates the change amount, which represents the amount of ELO points that playerA will take from playerB.

The concept of an ELO implementation can be defined in a few steps:

  1. Every player in a league has an “Elo Rating”, which is a point total representing their skill level relative to the rest of the league.
  2. New players are automatically assigned an initial rating.
  3. When matches are completed between two players, their ELO points are compared. The winner of the match will “take” a certain number of ELO points from the loser.
  4. Players with higher ELOs are expected to win, so winning will net them few points. Conversely, when an underdog wins, they take many points from the winner.

Performance in an ELO system is not measured absolutely; rather, it is inferred from wins, losses, and draws against other players.


The actual ELO algorithm consists of 5 parts:

Rn = Ro + K * (S - E)

Rn and Ro should be self-explanatory; these are simply the previous ELOs of each player. To fully understand the algorithm, lets take a deeper look at S, E, and K.


S (Actual Score) #

This is the actual score of the match. It will always be 0 <= x <= 1 . In the case of chess:

In each case, this is the outcome for the player whose ELO is currently being calculated (So in our example above where Amy beat Brad, Amy would have a score of 1 for this calculation, while Brad would have a score of 0).

Note: Some elo implementations will account for the margin of victory (One example being 538’s NFL Ratings). This is non-standard, and will not be covered it in this post.


E (Expected Outcome) #

This is the meat of the algorithm. It takes two inputs: Ra and Rb (The old ratings for player A and player B), then returns an expected win percentage for player A. It uses a Cumulative Distribution Function for Score Estimation.

While you don’t need to know what a CDF is, it is helpful to understand: 1 2

While any CDF can be used as a score estimation function, the most popular choices are the Normal and Logistic functions. While Arpad Elo’s original algorithm utilized the Normal CDF, the World Chess Federation has since transitioned to use the Logistic implementation.

lognormal.png(Logistic is blue, Normal is red)

The core difference is that the default Logistic distribution has thicker tails, meaning that “upsets” are assumed to be more likely (which helps account for inconsistency in player ability). I’ll show how to implement the algorithm using a Logistic CDF, but note that a Normal CDF (or any other CDF) could be easily substituted for the score estimation function.


Score Estimation - Logistic CDF

The Logistic CDF looks like this:

logistic-cdf.png

I’ve implemented it in pseudocode below .
(You can see that we use mean μ=0)

x = Ra - Rb
exponent = -(x/scale)
expected = 1/(1 + e^exponent)

There are a few parts here:

The scale affects the “width” of the distribution. Larger scales will stretch the distribution, increasing the ELOs of the best players, while decreasing the ELOs of the worst players (Conversely, smaller scales will compress the distribution).

Screen Shot 2017-02-02 at 3.55.58 PM.png

However, the scale parameter by itself is difficult to reason about. How much more is a scale of 2 than a scale of 1? We can make some substitutions to this scale parameter to allow us to better reason about its relationship with player skill.

The FIDE Chess implementation transforms this scale into a factor which represents a “10X” skill differential. By substituting s --> n/ln(10) we are now effectively saying that if PlayerA has n more ELO points than PlayerB, they are 10 times better (and should win 10/11, ~90.1% of the time).
[See footnote for in-depth explanation of this substitution]1

Our formula becomes:

x = Ra - Rb
s = n/ln(10)
exponent = -(x/s)
E = 1/(1+e^exponent)

Which can be further simplified into:

x = Ra - Rb
exponent = -(x/n)
E = 1/(1+10^exponent)

Now, after selecting a value for n, all you need to do is pass in Ra and Rb, and the Score estimation result will be returned. (The FIDE implementation uses n=400, but any positive value can be used). Let’s take a look at a few examples.

# When players are equal strength (n = 100)
A = 1000
B = 1000

x = (1000 - 1000) = 0
exponent = -(x/100) = 0
expected = 1/(1+10^exponent) = 0.5
# ∴ Score estimation for player A = 50%
# When player A is 10x better (n = 100)
A = 1100
B = 1000

x = (1100 - 1000) = 100
exponent = -(100/100) = -1
expected = 1/(1+10^exponent) = 0.9090...
# ∴ Score estimation for player A = 90.90%
# When player A is 40 ELO points better (n = 100)
A = 1234
B = 1184

x = (1234 - 1194) = 40
exponent = -(40/100) = -0.4
expected = 1/(1+10^exponent) = 0.7152
# ∴ Score estimation for player A = 71.52%

# Calculation for player B
x2 = (1184- 1234) = -50
exponent2 = -(-50/100) = 0.5
expected2 = 1/(1+10^exponent) = 0.2847
# ∴ Score estimation for player B = 28.47%
# Notice how it equals (1-estimation(A))

Now, we can make predictions of match outcomes - we just need to figure out the K-factor multiplier that is required to calculate the post-match ELO.


K-factor #

An effective K-factor is an extremely important part of any ELO implementation. The K-factor determines the “sensitivity” of how wins and losses impact your ELO.

Suppose we use the following model:

If K is large (eg. 70), then unexpected wins (or losses) can have a huge impact on a player’s ELO, and cause the rankings to be volatile.

playerA  = 1000
playerB = 1300

# playerA is expected to lose against playerB
E = 1/(1+10^-(-300/400)) = 0.150979557

# playerA wins against playerB
playerA = 1000 + 70 * (1 - 0.150979557)

# ∴ playerA = 1059.43143
# ∴ playerB = 1240.56857

This is a huge shift in ELO - even though playerB is expected to be ~5.6 (0.849/0.151) times better than playerA, a few lucky wins for playerA (or a few unlucky losses for playerB) could easily close the gap between their ELO rankings.

However, small K factors (eg. 5) can result in stagnant rankings, and would make it very difficult for players to gain ELO - even if they have been practicing/training, and are much more skilled than they had previously been.

playerA  = 1000
playerB = 1300

# playerA is expected to lose against playerB
E = 1/(1+10^-(-300/400)) = 0.150979557

# playerA wins against playerB
playerA = 1000 + 5 * (1 - 0.150979557)

# ∴ playerA = 1004.2451
# ∴ playerB = 1295.7549

It could take hundreds of games for playerA to match playerB’s ELO, even if playerA has trained and is now of similar skill [to playerB].


K-factor and Score Estimation

One last thing to consider is that K-factor should be appropriate for the current value of n (for score estimation).

The K-factor essentially represents the “Maximum” number of points that a player can gain during a match (If the players are of equal skill level, they will gain k/2 ELO points).

The n score estimation parameter represents the ELO point differential for a player that is 10x better than another player (meaning that a player with an ELO of 1400 is 10x better than a 1000 ELO player).

This means that, if a player always plays against players of equal ELO level, they would have to win n/(k/2) games in order to be 10 times better than they originally were. For a typical online chess player, this would typically be 400/(15/2) = 53.33 games, or K = 800/53.33

The official World Chess Federation ratings actually use a tiered K-factor system, meaning that players could have different K-factors (and recall that they use an n of 400 for score estimation):

This system assumes an amount of uncertainty for the ratings of new players, making it easier for them to reach their appropriate ELO quickly. It and also makes it more difficult for extremely skilled players to lose ELO (eg. due to unlucky losses). Similar implementations can be seen in video games such as League of Legends or Overwatch, where you must play 10 “placement matches” before your ELO becomes publicly available.

Note that when players have differing ELOs:

  • playerA gains/loses Rn = Ro + K_a*(S-E)
  • playerB gains/loses Rn = Ro + K_b*(S-E)

You can see that this differs from our analogy of “taking” points from the other player (because each player would gain/lose a different amount). That is okay and is intentional, but should be taken into consideration when deciding to implement variable K-factors.

Soccer/Football rankings also use the K-factor in interesting ways. Sometimes, models will give matches of high importance a higher K-value (also with n = 400):

In matches such as the World Cup Finals, the players are typically trying as hard as possible, so it is an accurate representation of their skill level. However, in friendly matches the team might not be trying as hard as they can - so upsets may be more likely - and thus why the ranking systems assign less importance to those matches.

To wrap up my recommendation for K-factors:

(Recall that the denominator in the above calculations is the “Number of wins to become 10 times better”).

Some leagues will start with a base K-factor for the first season or two of play, and then refine the K-factor once they have more data. While K-factor tuning is outside the scope of this post, you can take a look here, here, or here.

Some algorithms even use a variable K factor for every player, based on the current “confidence” of their ELO ranking. One fantastic example is the Glicko Rating System. Another example is the Sonas Rating System.

You can also read some more posts about the K-Factor here, here, or here


Example Implementation #

At this point, we know how to design, tune, and implement an ELO algorithm from the ground up. Lets walk through the implementation and application of ELO for our initial ping-pong matches.

First, we need to make 3 decisions.

Now, we can see our initial table rankings (no games have been played yet.

Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 100 100 100 100



Game 1
We have the first game coming up: Amy vs. Brad. The score estimations are easy:

# x = Ra - Rb
# E = 1/(1+10^-(x/n))
Amy_E = 1/(1+10^-(0/50)) = 0.5
Brad_E = 1/(1+10^-(0/50)) = 0.5

The game is played, and Amy wins! Lets calculate their new ELOs.

# Rn = Ro + K*(S - E)
Amy = 100 + 5*(1-Amy_E) = 102.5
Brad = 100 + 5*(0-Brad_E) = 97.5

The resulting table of rankings:

Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 100 100 100 100
Amy Brad 2.5 102.5 97.5 100 100



Game 2
In the next game Dirk plays against Cindy, and wins. Using similar calculations to those above, we will get the following result:

Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 100 100 100 100
Amy Brad 2.5 102.5 97.5 100 100
Dirk Cindy 2.5 102.5 97.5 97.5 102.5



Game 3
Now, Amy plays against Cindy. They have differing ELOs, so this calculation is a little more interesting than the last two:

Amy_E = 1/(1+10^-(5/50)) = 0.557311634
Cindy_E = 1/(1+10^-(-5/50)) = 0.442688366

You can see that the algorithm has given Amy a slight edge. She has won one game, and Cindy has lost one game, so we know that Amy is better - but at this point, it is tough to say how much better she really is.

The game is played, and Amy wins again. Lets calculate their new ELOs.

Amy = 102.5 + 5*(1-Amy_E) = 104.713442
Cindy = 97.5 + 5*(0-Cindy_E) = 95.2865582

Some algorithms will round to a full integer, and others will leave it as a full float value. In our case, lets round to two decimal points.

Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 100 100 100 100
Amy Brad 2.5 102.5 97.5 100 100
Dirk Cindy 2.5 102.5 97.5 97.5 102.5
Amy Cindy 2.5 104.71 97.5 95.29 102.5



Game 4
Lets play one more match. Dirk vs. Cindy.

Cindy_E = 1/(1+10^-(-7.21/50)) = 0.417746101
Dirk_E = 1/(1+10^-(7.21/50)) = 0.582253899

We can see that Dirk has the advantage. The game is played, and Dirk wins against Cindy. Their resulting ELOs:

Cindy = 95.29 + 5*(0-Cindy_E) = 93.201269495
Dirk = 102.5 + 5*(1-Dirk_E) = 104.588730505
Winner Loser ELO Change Amy Brad Cindy Dirk
- - - 100 100 100 100
Amy Brad 2.5 102.5 97.5 100 100
Dirk Cindy 2.5 102.5 97.5 97.5 102.5
Amy Cindy 2.5 104.71 97.5 95.29 102.5
Dirk Cindy 2.5 104.71 97.5 93.20 104.59


A really interesting point to note is that Amy actually ends up with a higher ELO than Dirk - even though they started with the same ELOs before they played against Cindy (which they both ended up winning). The reason for this is that Amy played Cindy before Dirk did. At that point, Cindy had a higher ELO (and was expected to be better). By the time Dirk played her, Cindy had a lower ELO, and thus he received less points for winning (as his expected win chance was higher). This is an interesting side effect of ELO, and shows how ELO rankings can be relatively unreliable until many games are played (which is when the ELO rankings start to become a better representation of each player’s relative skill).



Wrap-Up #

All things considered, ELO is a fantastic ranking system. It’s relative simplicity makes it easy to implement and understand, while also making it easy to tweak the algorithm and apply it to many different situations.

The next time you need to add a ranking system to your app or game, I hope that you consider using ELO.





Footnotes #

1 We are trying to make the scale parameter more meaningful for our model. The goal will be to find a substitution ny for s such that, when playerA has n more ELO points than playerB, they will be expected to be 10 times better than playerB (Meaning that our of 11 games, they are expected to win 10 and lose 1). In this case we have arbitrarily chosen 10x - we could have also chosen 5x or 2x (and ended up with a different resulting model). The unknown that we are solving for is y.

# Original Logistic CDF
10*(1/(1+e^-(-x/s))) == 1/(1+e^-(x/s))

# Substitute s --> ny
10*(1/(1+e^-(-x/ny))) == 1/(1+e^-(x/ny))

10*(1/(1+e^-(-n/(ny)))) == 1/(1+e^-(n/(ny)))
10*(1/(1+e^-(-1/y))) == 1/(1+e^-(1/y))
10*(1/(1+e^(1/y))) == 1/(1+e^-(1/y))

# Reciprocals of both sides 
(1/10)*(1+e^(1/y)) == 1+e^-(1/y)
1+e^(1/y) == 10*(1+e^-(1/y))
1+e^(1/y) == 10*+10*e^-(1/y)
1+e^(1/y) == 10*+10/e^(1/y)
1+e^(1/y) == (10*(e^(1/y))*+10)/e^(1/y)
e^(1/y)(1+e^(1/y))== 10*(e^(1/y))*+10
e^(1/y)(1+e^(1/y))== 10 + 10*(e^(1/y))
e^(1/y)(1+e^(1/y))== 10(1 + e^(1/y))
e^(1/y)== 10
1/y== ln(10)
y == 1/ln(10)

# Therefore, s --> n*(1/ln(10)) is the required substitution.




 
115
Kudos
 
115
Kudos