The Math Behind ELO
If you’ve ever participated in some form of ranked sport or video game, chances are that you have heard about the ELO Rating System. Originally designed for ranking professional chess players, it has been adopted to many other domains such as college football, Scrabble, and video games. ELO’s popularity has been primarily driven by its impressive accuracy considering its simplicity, making it an ideal solution for someone that wants to add a ranking system to their own projects. even the initial version of the “Facebook Algorithm” was implemented using ELO!
I’ll give a brief overview of the ELO algorithm, its applications, and how it can be extended to better model the skill distributions of various playerbases.
The Algorithm #
ELO is generally used to rank a set of players by using a set of historical match results. Suppose we have 4 competitive ping-pong players; Amy, Brad, Cindy, and Dirk.
Above, the ELO Change column is the result that ELO algorithm generates for us. Given three values - PlayerA ELO, PlayerB ELO, and PlayerA’s Score (whether they win or lost) - the algorithm calculates the change amount, which represents the amount of ELO points that playerA will take from playerB.
The concept of an ELO implementation can be defined in a few steps:
- Every player in a league has an “Elo Rating”, which is a point total representing their skill level relative to the rest of the league.
- New players are automatically assigned an initial rating.
- When matches are completed between two players, their ELO points are compared. The winner of the match will “take” a certain number of ELO points from the loser.
- Players with higher ELOs are expected to win, so winning will net them few points. Conversely, when an underdog wins, they take many points from the winner.
Performance in an ELO system is not measured absolutely; rather, it is inferred from wins, losses, and draws against other players.
The actual ELO algorithm consists of 5 parts:
- Rn - The new rating of the player (After the match)
- Ro - The old rating of the player (Before the match)
- S - The actual score/outcome of the match
- E - The expected score/outcome of the match
- K - The K-factor
Rn = Ro + K * (S - E)
Rn and Ro should be self-explanatory; these are simply the previous ELOs of each player. To fully understand the algorithm, lets take a deeper look at S, E, and K.
S (Actual Score) #
This is the actual score of the match. It will always be
0 <= x <= 1 . In the case of chess:
- a win counts as 1,
- a tie counts as 0.5,
- and a loss counts as 0.
In each case, this is the outcome for the player whose ELO is currently being calculated (So in our example above where Amy beat Brad, Amy would have a score of 1 for this calculation, while Brad would have a score of 0).
Note: Some elo implementations will account for the margin of victory (One example being 538’s NFL Ratings). This is non-standard, and will not be covered it in this post.
E (Expected Outcome) #
This is the meat of the algorithm. It takes two inputs: Ra and Rb (The old ratings for player A and player B), then returns an expected win percentage for player A. It uses a Cumulative Distribution Function for Score Estimation.
While you don’t need to know what a CDF is, it is helpful to understand: 1 2
While any CDF can be used as a score estimation function, the most popular choices are the Normal and Logistic functions. While Arpad Elo’s original algorithm utilized the Normal CDF, the World Chess Federation has since transitioned to use the Logistic implementation.
(Logistic is blue, Normal is red)
The core difference is that the default Logistic distribution has thicker tails, meaning that “upsets” are assumed to be more likely (which helps account for inconsistency in player ability). I’ll show how to implement the algorithm using a Logistic CDF, but note that a Normal CDF (or any other CDF) could be easily substituted for the score estimation function.
Score Estimation - Logistic CDF
The Logistic CDF looks like this:
I’ve implemented it in pseudocode below .
(You can see that we use mean μ=0)
x = Ra - Rb exponent = -(x/scale) expected = 1/(1 + e^exponent)
There are a few parts here:
- Ra - Player A’s old elo
- Rb - Player B’s old elo
- x - The difference of Ra and Rb
- e - Euler’s Number (A mathematical constant)
- scale - The preselected standard deviation for the scoring system
The scale affects the “width” of the distribution. Larger scales will stretch the distribution, increasing the ELOs of the best players, while decreasing the ELOs of the worst players (Conversely, smaller scales will compress the distribution).
However, the scale parameter by itself is difficult to reason about. How much more is a scale of 2 than a scale of 1? We can make some substitutions to this scale parameter to allow us to better reason about its relationship with player skill.
The FIDE Chess implementation transforms this scale into a factor which represents a “10X” skill differential. By substituting
s --> n/ln(10) we are now effectively saying that if PlayerA has n more ELO points than PlayerB, they are 10 times better (and should win 10/11, ~90.1% of the time).
[See footnote for in-depth explanation of this substitution]1
Our formula becomes:
x = Ra - Rb s = n/ln(10) exponent = -(x/s) E = 1/(1+e^exponent)
Which can be further simplified into:
x = Ra - Rb exponent = -(x/n) E = 1/(1+10^exponent)
Now, after selecting a value for n, all you need to do is pass in Ra and Rb, and the Score estimation result will be returned. (The FIDE implementation uses n=400, but any positive value can be used). Let’s take a look at a few examples.
# When players are equal strength (n = 100) A = 1000 B = 1000 x = (1000 - 1000) = 0 exponent = -(x/100) = 0 expected = 1/(1+10^exponent) = 0.5 # ∴ Score estimation for player A = 50%
# When player A is 10x better (n = 100) A = 1100 B = 1000 x = (1100 - 1000) = 100 exponent = -(100/100) = -1 expected = 1/(1+10^exponent) = 0.9090... # ∴ Score estimation for player A = 90.90%
# When player A is 40 ELO points better (n = 100) A = 1234 B = 1184 x = (1234 - 1194) = 40 exponent = -(40/100) = -0.4 expected = 1/(1+10^exponent) = 0.7152 # ∴ Score estimation for player A = 71.52% # Calculation for player B x2 = (1184- 1234) = -50 exponent2 = -(-50/100) = 0.5 expected2 = 1/(1+10^exponent) = 0.2847 # ∴ Score estimation for player B = 28.47% # Notice how it equals (1-estimation(A))
Now, we can make predictions of match outcomes - we just need to figure out the K-factor multiplier that is required to calculate the post-match ELO.
An effective K-factor is an extremely important part of any ELO implementation. The K-factor determines the “sensitivity” of how wins and losses impact your ELO.
Suppose we use the following model:
- everyone starts with an ELO of 1200
- we use a score estimation function where n=400.
If K is large (eg. 70), then unexpected wins (or losses) can have a huge impact on a player’s ELO, and cause the rankings to be volatile.
playerA = 1000 playerB = 1300 # playerA is expected to lose against playerB E = 1/(1+10^-(-300/400)) = 0.150979557 # playerA wins against playerB playerA = 1000 + 70 * (1 - 0.150979557) # ∴ playerA = 1059.43143 # ∴ playerB = 1240.56857
This is a huge shift in ELO - even though playerB is expected to be ~5.6 (0.849/0.151) times better than playerA, a few lucky wins for playerA (or a few unlucky losses for playerB) could easily close the gap between their ELO rankings.
However, small K factors (eg. 5) can result in stagnant rankings, and would make it very difficult for players to gain ELO - even if they have been practicing/training, and are much more skilled than they had previously been.
playerA = 1000 playerB = 1300 # playerA is expected to lose against playerB E = 1/(1+10^-(-300/400)) = 0.150979557 # playerA wins against playerB playerA = 1000 + 5 * (1 - 0.150979557) # ∴ playerA = 1004.2451 # ∴ playerB = 1295.7549
It could take hundreds of games for playerA to match playerB’s ELO, even if playerA has trained and is now of similar skill [to playerB].
K-factor and Score Estimation
One last thing to consider is that K-factor should be appropriate for the current value of n (for score estimation).
The K-factor essentially represents the “Maximum” number of points that a player can gain during a match (If the players are of equal skill level, they will gain k/2 ELO points).
The n score estimation parameter represents the ELO point differential for a player that is 10x better than another player (meaning that a player with an ELO of 1400 is 10x better than a 1000 ELO player).
This means that, if a player always plays against players of equal ELO level, they would have to win
n/(k/2) games in order to be 10 times better than they originally were. For a typical online chess player, this would typically be
400/(15/2) = 53.33 games, or
K = 800/53.33
The official World Chess Federation ratings actually use a tiered K-factor system, meaning that players could have different K-factors (and recall that they use an n of 400 for score estimation):
- K=40 for new players until they play 30 games
- K=20 for players with > 30 games and never had an ELO > 2400
- K=10 for players with > 30 games and have had an ELO > 2400
This system assumes an amount of uncertainty for the ratings of new players, making it easier for them to reach their appropriate ELO quickly. It and also makes it more difficult for extremely skilled players to lose ELO (eg. due to unlucky losses). Similar implementations can be seen in video games such as League of Legends or Overwatch, where you must play 10 “placement matches” before your ELO becomes publicly available.
Note that when players have differing ELOs:
- playerA gains/loses
Rn = Ro + K_a*(S-E)
- playerB gains/loses
Rn = Ro + K_b*(S-E)
You can see that this differs from our analogy of “taking” points from the other player (because each player would gain/lose a different amount). That is okay and is intentional, but should be taken into consideration when deciding to implement variable K-factors.
Soccer/Football rankings also use the K-factor in interesting ways. Sometimes, models will give matches of high importance a higher K-value (also with n = 400):
- K=60 for World Cup Finals
- K=40 for World Cup Qualifiers
- K=30 for other other tournaments -K=20 for friendly matches
In matches such as the World Cup Finals, the players are typically trying as hard as possible, so it is an accurate representation of their skill level. However, in friendly matches the team might not be trying as hard as they can - so upsets may be more likely - and thus why the ranking systems assign less importance to those matches.
To wrap up my recommendation for K-factors:
- If few matches will be played (such as American Football, with ~16 games), then the rankings should be more volatile. Higher K factors [relative to n] would be better, so a K factor from
K = (2*n)/20to
K = (2*n)/25is more appropriate.
- If many matches will be played (such as Baseball, with ~162 games), then the rankings should be less volatile. Lower K factors [relative to n] would be better, so a K factor from
K = (2*n)/30to
K = (2*n)/45is more appropriate.
(Recall that the denominator in the above calculations is the “Number of wins to become 10 times better”).
Some leagues will start with a base K-factor for the first season or two of play, and then refine the K-factor once they have more data. While K-factor tuning is outside the scope of this post, you can take a look here, here, or here.
Some algorithms even use a variable K factor for every player, based on the current “confidence” of their ELO ranking. One fantastic example is the Glicko Rating System. Another example is the Sonas Rating System.
You can also read some more posts about the K-Factor here, here, or here
Example Implementation #
At this point, we know how to design, tune, and implement an ELO algorithm from the ground up. Lets walk through the implementation and application of ELO for our initial ping-pong matches.
First, we need to make 3 decisions.
- n for scoring estimation. Lets (arbitrarily) select 50.
K-factor. Because it is unlikely that we will have a lot of matches, lets select a higher K-factor [relative to n]:
K = (2*50)/20 = 5.
- Initial ELO. Lets (arbitrarily) select 100.
Now, we can see our initial table rankings (no games have been played yet.
We have the first game coming up: Amy vs. Brad. The score estimations are easy:
# x = Ra - Rb # E = 1/(1+10^-(x/n)) Amy_E = 1/(1+10^-(0/50)) = 0.5 Brad_E = 1/(1+10^-(0/50)) = 0.5
The game is played, and Amy wins! Lets calculate their new ELOs.
# Rn = Ro + K*(S - E) Amy = 100 + 5*(1-Amy_E) = 102.5 Brad = 100 + 5*(0-Brad_E) = 97.5
The resulting table of rankings:
In the next game Dirk plays against Cindy, and wins. Using similar calculations to those above, we will get the following result:
Now, Amy plays against Cindy. They have differing ELOs, so this calculation is a little more interesting than the last two:
Amy_E = 1/(1+10^-(5/50)) = 0.557311634 Cindy_E = 1/(1+10^-(-5/50)) = 0.442688366
You can see that the algorithm has given Amy a slight edge. She has won one game, and Cindy has lost one game, so we know that Amy is better - but at this point, it is tough to say how much better she really is.
The game is played, and Amy wins again. Lets calculate their new ELOs.
Amy = 102.5 + 5*(1-Amy_E) = 104.713442 Cindy = 97.5 + 5*(0-Cindy_E) = 95.2865582
Some algorithms will round to a full integer, and others will leave it as a full float value. In our case, lets round to two decimal points.
Lets play one more match. Dirk vs. Cindy.
Cindy_E = 1/(1+10^-(-7.21/50)) = 0.417746101 Dirk_E = 1/(1+10^-(7.21/50)) = 0.582253899
We can see that Dirk has the advantage. The game is played, and Dirk wins against Cindy. Their resulting ELOs:
Cindy = 95.29 + 5*(0-Cindy_E) = 93.201269495 Dirk = 102.5 + 5*(1-Dirk_E) = 104.588730505
A really interesting point to note is that Amy actually ends up with a higher ELO than Dirk - even though they started with the same ELOs before they played against Cindy (which they both ended up winning). The reason for this is that Amy played Cindy before Dirk did. At that point, Cindy had a higher ELO (and was expected to be better). By the time Dirk played her, Cindy had a lower ELO, and thus he received less points for winning (as his expected win chance was higher). This is an interesting side effect of ELO, and shows how ELO rankings can be relatively unreliable until many games are played (which is when the ELO rankings start to become a better representation of each player’s relative skill).
All things considered, ELO is a fantastic ranking system. It’s relative simplicity makes it easy to implement and understand, while also making it easy to tweak the algorithm and apply it to many different situations.
The next time you need to add a ranking system to your app or game, I hope that you consider using ELO.
1 We are trying to make the scale parameter more meaningful for our model. The goal will be to find a substitution ny for s such that, when playerA has n more ELO points than playerB, they will be expected to be 10 times better than playerB (Meaning that our of 11 games, they are expected to win 10 and lose 1). In this case we have arbitrarily chosen 10x - we could have also chosen 5x or 2x (and ended up with a different resulting model). The unknown that we are solving for is y.
# Original Logistic CDF 10*(1/(1+e^-(-x/s))) == 1/(1+e^-(x/s)) # Substitute s --> ny 10*(1/(1+e^-(-x/ny))) == 1/(1+e^-(x/ny)) 10*(1/(1+e^-(-n/(ny)))) == 1/(1+e^-(n/(ny))) 10*(1/(1+e^-(-1/y))) == 1/(1+e^-(1/y)) 10*(1/(1+e^(1/y))) == 1/(1+e^-(1/y)) # Reciprocals of both sides (1/10)*(1+e^(1/y)) == 1+e^-(1/y) 1+e^(1/y) == 10*(1+e^-(1/y)) 1+e^(1/y) == 10*+10*e^-(1/y) 1+e^(1/y) == 10*+10/e^(1/y) 1+e^(1/y) == (10*(e^(1/y))*+10)/e^(1/y) e^(1/y)(1+e^(1/y))== 10*(e^(1/y))*+10 e^(1/y)(1+e^(1/y))== 10 + 10*(e^(1/y)) e^(1/y)(1+e^(1/y))== 10(1 + e^(1/y)) e^(1/y)== 10 1/y== ln(10) y == 1/ln(10) # Therefore, s --> n*(1/ln(10)) is the required substitution.