Hello all,
It’s great that you’re looking into improving the rating system. I’ve wanted to do that myself for a long time, but haven’t had the opportunity; I’m certainly not offended that after 8 years the week of work I did as a teenager is being critically reviewed! I’ve a few things to discuss here.
1. Comments / Discussion of Ghost Rating
2. Rating Sum of Squares games
3. Rating Categories
4. Trueskill
5. Having an official rating system
=============Ghost Rating=============
The reason losses are always the same and wins different amounts of gain is because I chose to scale the learning rate with the ratings of the players. This was pretty arbitrary, but the reasoning was that we needed to have ratings move fast for strong players to converge to near their actual rating in a similar number of games as would average players: in pre 1.0 versions of GR, the top ranks measured “who’s the good player who has played most games”, and since the target audience was always the top players, it needed fixing.
A_Tin_Can, when you tested between having the learning rate change based on player’s ratings, or having it constant, did you tune for the hyperparameter I set at 17.5 in the initial version?
Of course, that priority is only appropriate for an off-site system, and a measure such as squared errors is a good one for general users. I personally prefer the measure SIGMA(log(E_i^R_i)), where R_i is the result for player i, and E_i is their expected result.
Diplomacy is difficult to rate because you get relatively little data, and it’s quite noisy too. Furthermore, players play vastly different numbers of games, whereas Elo assumes that the numbers of games played by each player is similar. There is definitely room for improvement, so I won’t spend more time discussing this version.
=============SoS games=============
On webDiplomacy, players are assumed to try to maximise their points return. In situations where the return cannot be 100% of the pot, this presents problems. You need some sort of model which will give a sensible expected result. In PPSC, the way that was achieved was essentially to decompose the game into two: A fight for the win, and then a fight for second. I can’t work out a good model for SoS games.
From the point of view of a ratings designer, I hate the SoS scoring system. I think it makes sense as a tournament scorer, but it is really impossible to directly use to rate players, because just writing down a sensible model is difficult.
It also has weird behaviour in terms of your score going up or down in ways that seem weird. Is a solo really less valuable if there are 2 other players left than if there are 4? This is more of a view on how diplomacy works issue, but I would think not. More importantly, is that difference reflective of the player’s skill?
=============Rating Categories=============
Again, we want to be able to have category ratings, but we lack the data to do so effectively. When the first category ratings came out, I tested their prediction accuracy versus the accuracy of the general ratings on the subcategory, and the category restricted versions were all worse.
That is to say, if you wanted to bet on a Mediterranean map, public press game, the best method was to look up the standard GR, not any subcategory GR.
Obviously, that is a bad situation to be in: I also know a simple methods I would like to use to solve it, but I don’t know whether it integrates with out-of-the-box Bayesian methods:
In brief, a player i has a rating vector r_i in R^k, where k is a hyperparameter. Each possible game setting s has a vector associated with it, w_s. The player’s rating in that particular game setting is then the dot product of r_i and w_s. Thus when versions of the game are similar, the weight vectors align, and the dot products are also similar. Utterly unrelated games would have perpendicular weight vectors.
=============Trueskill=============
Trueskill is a good system. However, when I looked into it, I decided that it wouldn’t be appropriate for webdiplomacy, particularly as an official integration. The reason was that it relies only on the ranking of the players. Sum of Squares, for instance, changes dramatically to just being about who has the most SCs. This is a serious misalignment of the points objective and the rating system objective. In creating GR, I held that sacrosanct, because you don’t want different players attempting to achieve different win conditions (there’s enough bitching about that in diplomacy already!)
Secondly, I want the only thing that affects your new rating to be: The ratings of your opponents, and your points return. This is essentially the same thing: I don’t want a 3-way draw in DSS to score any differently for the member of it based on who the other two players are. From memory, trueskill doesn’t do this.
My plan (if I had ever had the time to do it), was to take the same models for expected return in PPSC vs WTA (because this was the scoring back then) and to do a maximum a posterior fit. Then I would use a LaPlacean approximation to fit a normal (or possibly a thicker tailed e.g. t-) distribution to each players’ rating, and report the conservative estimates (like used by Trueskill). The online version, where you find the max a posterior for a single game, fit a normal around it, and then move on to the next game, was my intended GR 2.0
=============Official Rating Systems=============
Kestas didn’t want to have a new rating system on the website, and he had valid concerns.
Once you have the rating system on the site, the distortive effects will be much more pronounced than in the status quo. Points actively encourage playing weaker players, but many rating systems struggle with overconfidence in the tails, meaning that with large disparities, high rated players stand to lose rating by playing.
If points continue to exist, then the alignment problem that I was discussing above becomes even more important. I don’t think you can have an official rating system where the strategy for maximising points return for a game, and the strategy for maximising the rating you receive after a game, are different.
Finally, I would consider carefully switching from a monthly publishing cycle. I don’t know what it is like now, but I remember it being something which actually added to the popularity of the ratings, at least among forum goers. Live updates certainly lose some drama, whilst the immediate feedback may only encourage gaming the system more. (That said, I obviously understand the appeal of updating live, too- I don’t know where I would stand if I were an active player!)