"Dirty Data" in the downloadable FIDE Rating List (XML) - duplicate IDs

The very latest International round up of English news.
Brian Towers
Posts: 1266
Joined: Tue Nov 18, 2014 7:23 pm

"Dirty Data" in the downloadable FIDE Rating List (XML) - duplicate IDs

Post by Brian Towers » Mon Feb 15, 2016 10:38 pm

A quick explanation of what "dirty data" is.
It would be perfectly acceptable to have several "John Smiths" in a ratings database but it would not be acceptable to have several players with the same ID (eg FIDE ID in the FIDE database, ECF Grading ID in the ECF database for instance). The standard approach in a relational database is to have something called a "constraint" to prevent this kind of thing happening. In this case it would be a "unique index" constraint on the ID field or column. This would generate an error if an attempt to add a record with a duplicate ID were made and prevent this from happening. Not having such a constraint would very likely lead over time to "dirty data" getting into the database.

Other examples of dirty data would be negative ratings (no sign of these), values other than "M" and "F" in the sex field (check! although anything other than "F" is treated as "M" when it comes to the website), values other than empty or "i" in the flag field (also check).

This comes up because I've downloaded the February 2016 FRL (https://ratings.fide.com/download.phtml) with a view to loading it into a relational database and doing some data mining. Before doing that I need to design my database structure and to do that effectively I need first to sift through the data to see what assumptions I can safely make.

I've downloaded the full list, new format, zipped XML file which includes all players, ratings, titles, Arena titles but not Arena grades.

Of course there are many fairly obvious (illegitimate) duplicate names but there are also 3 cases of duplicate IDs!

Two of them are duplicate entries with the same name in the file and involve players with Arena titles where one record has the Arena title and the other doesn't but on the FIDE website there is only one listing. The third, however, has two versions of the same name and results in two listings on the FIDE website!

Here are the offending records -

809810 -- Solaroli, Riccardo AIM, NI
9002316 -- Benaddi, Redouane AIM
150035553 -- Mr., Jonathan Rose
150035553 -- Rose, Jonathan
and also note
150269679 -- Rose, Jonathan
150269937 -- Rose, Jonathan

Jonathan Rose (who may very well be fictitious) is very well endowed with listings, although he has no titles, no rating history and, in a sense, no federation with FIDE listed as the federation.

Interesting to note also Jonathan Rose's unusual 9 digit FIDE ID. Glancing through the file I see many of the FIDE federated players have 9 digit IDs and quite a high percentage have Arena titles. Looks like this is part of the Arena implementation.

If I had to guess I would suggest that "Jonathan Rose" is a result of Arena related testing on the live database and the other two anomalies are a result of glitches in the introduction of Arena.

I really would recommend a unique index constraint on the Fide ID field!
Ah, but I was so much older then. I'm younger than that now.