This is nothing people don’t know (I’ve told people myself before), but there were good reasons for doing what I did.
“Wouldn't the actual Elo system be more accurate?”
No, for two reasons:
1. The small number of games people play is a serious problem. Madmarx’s skill level would still be higher than his Ghost-Rating would indicate if I didn’t put in the scaling factor.
Originally the system didn’t have this, but it made a mockery of the top places in the system: MM and TMG were swapping places not based on who had played better that month but based on who had played *more* that month. One of the most important features of G-R is that it doesn’t matter if you’ve only played 30 games, you can still be highly rated.
Of course, you could have a steadily reducing volatility, but that would either introduce non-zero sum issues (if done individually) or defeat the point of the system and leave players unable to advance much against experienced players (if done on a by-game basis)
2. The rationale behind the Elo Expected result formula doesn’t work either. Elo argued that, in chess, if you play better than the other guy, you will win, and if you play worse than the other guy, you will lose. He modelled the standard of play of each player as a Normal distribution, and then looked at the probability that one player played better than the other.
In diplomacy this argument doesn’t work. Playing better than everyone else only gives you a better chance of winning, because a less good player might have the game more or less thrown to him by a real fool in a manner that you cannot do anything about.
Therefore I designed a rating system whereby people’s ratings reflect the relative average scores, and use ratios to calculate the expected result