02 Sep 2023

Analyzing rating systems for foosball outcomes

Color me disappointed

I’m in the possession of data from around 26’000 matches of foosball played over a span of 9 years. This data was captured with a web application that also maintains an Elo score for each player, ranking them on a leaderboard. Naturally, one has to wonder if Elo is the optimal rating system for this particular game. But first, what is a good rating system in the first place?

A rating systems affords a player a score. This score should in some way reflect on the skill of the player: A player with a better score is expected to beat a player with a worse score more often than not. The ability to compare scores then implies that we can build a ranking. Since a ranking of players should accurately predict the outcome of a match, we can call a rating system better, if it can correctly predict a higher proportion of games than a different system.

With an extensive data set at hand, it should be easy to explore the performance of a rating system. For every historic match in the data set, we compare the prediction of the ranking up to that match against the real outcome, then enter this match into the rating system and continue. At the end we can calculate how many matches have been predicted correctly by the ranking.

Let’s take a simple rating system as an example: We compute for each player a simple win rate of every match they played. A player with no matches played starts out with a win rate of 0. When predicting the outcome of a match, we simple average the win rate of both players on a team and weigh them against each other: The team with the higher average win rate is expected to win. With our dataset this yields a correct prediction rate of 72.7%.

The twist

Unsurprisingly, there are more sophisticated (and supposedly more accurate) rating systems out there. I looked at TrueSkill and Weng-Lin in particular due to their ability to treat 2v2 games, rather than just 1v1 for systems like Elo.

Imagine my shock when these also yield a correct prediction in just 73.3% and 71.2% of all matches. Tuning their parameters did not achieve significant improvements.

My best guess is that there is just so much randomness in the outcomes of our foosball games that more sophisticated systems just can’t produce a better result than the primitive win rate. Sorry to leave you hanging on such an unsatisfying conclusion. Ok bye.