Note: I ran through all of this and then thought – my god, this must have been done to death already – a quick google brought me here: http://www.diplom.org/Zine/S1998R/Nichols/ratings2.html. It’s been done before but I’m sure with our great minds we can come up with some appropriate tweaks.
The Elo rating has been a very successful and widely used system so I’m not sure why we’d try reinvent the wheel so my idea is to try figure out how to best fit the Elo system to diplomacy. I’m first only going to consider WTA and then give a few words on how I’d probably adjust it for PPSC.
First, let’s look at what the formula is for the Elo rating:
R_i’ = R_i + K * (S_i – E_i)
R_i is the rating of player i
‘K-value’ is a variable that affects the rate at which ratings change
S_i is the score of player i during a period (single game, month, tournament, etc.)
E_i is the expected score of the player i during that period
In chess, the score of a player is defined as 1 for a win, 0.5 for a draw, and 0 for a loss. In diplomacy the appropriate analogy, imho, would be 1/n where n is the number of players in the final draw/solo. Of course, you could come up with a different weighting where solos and smaller draws are more highly rated as long as it the score of every player sums to 1. However, for the sake of this argument, let’s leave it at 1/n.
The next step is to determine what an expected score of two differently rated players would be. This can be arbitrarily chosen. In chess, a spread of 200

results in an expected score of .75 (i.e. if you played two games, a player rated 200

higher than another would be expected to win one and draw one). This can be tweaked to adjust the spread and but, for the sake of simplicity, let’s just leave it there – although an expected score of .75 in diplomacy is very high.
Writing formulae in the forum isn’t pretty so it may be best to follow the wiki page to understand what the hell I’m trying to type. http://en.wikipedia.org/wiki/Elo_rating_system#Mathematical_details
For Elo rating, E_a is given by
E_a = 1/{ 1 + 10 ^ [ ( R_b – R_a) / 400] }
Where a is the player we’re evaluating and b is the opponent.
This is where the analogy breaks down and we need to make a decision. In diplomacy there are, of course, 6 opponents and not 1. There are two ways apparent to me for how to deal with this.
Option 1:
R_b = (R_b1 + R_b2 + … R_b6) / 6
Where b1 through 6 are your opponents
Option 2:
E_a = ( 1/{ 1 + 10 ^ [ ( R_b1 – R_a) / 400] } + 1/{ 1 + 10 ^ [ ( R_b2 – R_a) / 400] } … ) / 6
Strangely, I don’t own a pen or pencil in my house (Jesus that’s pretty weird, eh?) so I can’t work through this any further to deduce anything from the math but I imagine, at least, that in both cases E_a+E_b1+..+E_b6 still equals 1.
I see no obvious advantage to either one but maybe with some more digging there’ll be something. At very least, I’d lean towards the first option because of the simplicity.
The next and final step is choosing K. K determines how quickly ratings can fluctuate. A K that is too large will cause ratings to fluctuate wildly and unpredictably and a K that is too low will make it take forever for players to reach their true ratings. Furthermore, K does not have to be a constant. In chess, K often depends on a players score. The thinking is that high rated chess players don’t change as much while weaker players can alter their skill level quite significantly. Another way is make K dependent on the number of games played. I think the latter would work best for diplomacy. We could make it exceptionally large for first few games and then bring it back down to Earth after, say, 5 games or so. Really, this is something that has plenty of room for great discussion.
Other ideas that could be floated around are:
- Vaft can easily calculate the expected scores of each power compared to the other. We could take this ratings to tweak the expected scores.
- A means for injecting or removing points to maintain a constant average rating
- A rating floor
- How to incorporate this into a combined rating system with FP,GB, etc. Adjusting the K value would be the obvious answer
- A penalty for resigns. You could count them as double losses or something along that line.
- Figuring out an equivalent expected score for PPSC. I expect that it would be difficult to find a range of values that would exactly match WTA expected scores.
- Keeping provisional scores (<n games) off the leaderboard
Meh, that’s all for now. Looks like I Obi’d all over the page.