GRADING ANOMALIES

Robert Jurjevic · Post by **Robert Jurjevic** » Wed May 20, 2009 7:04 pm

Hello Sean, in a nutshell, let us assume that player 'A' played a player 'B' and that 'A' scored less than what the grade difference would suggest.

Then, according to GS, when calculating grade of player 'A', it is assumed that player 'A' scored less than what the grade difference would suggest exclusively because player 'A' played below level his or her grade would suggest (while player 'B' played exactly at the level his or her grade would suggest), but when calculating grade of player 'B', it is assumed that player 'B' scored more than what the grade difference would suggest exclusively because player 'B' played above level his or her grade would suggest (while player 'A' played exactly at the level his or her grade would suggest), which is a logical contradiction.

On the other hand, according to AGS, when calculating grades of players 'A' and 'B', it is assumed that player 'A' scored less than what the grade difference would suggest because both, player 'A' played below level his or her grade would suggest, and player 'B' played above level his or her grade would suggest, it is assumed that as much player 'A' played worse that much player 'B' played better, i.e., that both players contributed equally to the difference between actual and expected performance.

Therefore, AGS should be better than GS.

In Ã‰GS2 a further improvement has been made where both players need not to contribute equally to the difference between actual and expected performance, and how much each of the players contributes is calculated based on the estimate of how much one's grade can be trusted (which is in turn based on the number of games the players played in the last season), it is assumed that the less the grade is trusted the more is the contribution to the difference between actual and expected performance. Consequently, less trusted grades change more rapidly than more trusted grades.

Another improvement in Ã‰GS2 over AGS is in definition of 'p = f(d)' which corresponds to the red line (used in Ã‰lo system and is regarded as a very accurate relationship between chess ability differences and expected performances) in the figure 1 below.

Figure 1: Relationship between expected performance 'p' and grade difference 'd' as defined in GS (green line), CGS and AGS (blue line) and Ã‰GS and Ã‰GS2 (red line). Expected performance 'p' is a function of grading difference 'd', i.e. 'p = f(d)'.

Sean Hewitt wrote: I haven't seen that suggested before.

The first consequence of it is that two players with identical results would get different grades. eg

Player A (graded 180) plays 30 games against opponents who's average grade is 150 and scores 50%. Under this suggestion he would be graded (180+150)/2 = 165.

Player B (graded 170) plays the same 30 opponents and also scores 50%. His new grade would be (170+150)/2 = 160. So graded 5 points less for an identical performance. This does not seem logical to me.

Under the ECF system both players would get a grade of 150 in the above scenario.

If player 'A' (graded 180) and player 'B' (graded 170) both play a pool of 30 players with the same grade (say 150) and score the same (say 50%), it looks like as if player 'A' (whose grade is higher than the grade of player 'B') should be penalized more (he should lose more grading points than player 'B'), as expectation from player 'A' was higher than from player 'B'.

Indeed, this is the case in both cases, AGS (David Shepherd's proposal is AGS with s=50, AGS has s=40) where player 'A' is penalized for 15 and player 'B' for 10 points, and GS where player 'A' is penalized for 30 and player 'B' for 20 points.

In my opinion, GS penalizes players 'A' and 'B' too much, in fact so much as if their under-performance is exclusively due to them playing worse than their grades would suggest (i.e. assuming that the pool of players was playing exactly as the grades in the pool would suggest), but, in fact, we cannot tell for sure why players 'A' and 'B' did under-perform, so AGS would penalize players 'A' and 'B' so much as if their under-performance is due to both, them playing worse than their grades would suggest, and the players in the pool playing better than their grades would suggest (it is assumed that as much players 'A' and 'B' played worse that much players from the pool played better). Also, wen calculating grades of players from the pool, according to GS, the assumption is that the players from the pool over-performed exclusively because they played better (i.e. it is assumed that players 'A' and 'B' were performing exactly as their grades would suggest), which contradicts the assumption made when calculating grades of players 'A' and 'B'.

Robert Jurjevic · Post by **Robert Jurjevic** » Thu May 21, 2009 10:19 am

Roger de Coverly wrote:A premise of the ECF grading system is equal grade for equal work. So if two players get the same results in the same period against the same average field, they get the same grade provided they've both played more than the 30 game qualification and provided their previous grade is not so far away from their performance that the 40 point cut off comes into play. So over 30 games, a player graded 175 scores 50% against an average field of 150, a player graded 125 does the same. They will both have the same grade at the next cut off being 150. If the performance is only over 15 games, then they still have different grades of 137.5 v 162.5

If you wanted to "remember" the previous grade in some manner even for players playing more than 30 games , then you've got Elo based systems. John Nunn called it "rating lag" at http://www.chessbase.com/newsdetail.asp?newsid=5418

Hello Roger,

Surely, taking at least 30 games in calculation makes grades, so to speak, statistically significant, or established, or trusted, but of course, the more games you take into calculation the more trusted the grade.

I proposed that Ã‰GS2 takes into account only games played in the season (even if less than 30 or say even only 1 or 0), and that should be (more or less) fine, as the system introduces a notion of grade trust (grade statistical significance) and change less trusted grades more rapidly than more trusted grades. Nevertheless, one could use the same scheme for Ã‰GS2 as it is used for GS, i.e. if a player played less than 30 games in a season, games from previous season are taken into account (there is enough info in the current data to do the calculation, the only data Ã‰GS2 needs and GS does not is the number of games played in the season, and that is known). But, even if the same scheme would be used for Ã‰GS2 as for GS (taking at least 30 games played if possible), in my opinion, Ã‰GS2 would be better system than GS, as besides its other advantages it would cope better with players who has not played 30 games yet (say, newcomers to the system, players who play say 3 games per season, etc.).

The 40 point cut off rule, is, in my opinion, there to better approximate logistic curve 'p = f(d)' (red line in figure 1 below), you see how the cut of green line at 'd = 40' makes the green line (in the figure 1 below) approximate the red line better, it makes fairly good approximation for grade difference approximately less than 65 (there are not may games played with grade difference greater than 65).

Figure 1: Relationship between expected performance 'p' and grade difference 'd' as defined in GS (green line) and Ã‰GS and Ã‰GS2 (red line).

Robert Jurjevic · Post by **Robert Jurjevic** » Thu May 21, 2009 11:14 am

Brian Valentine wrote:I think that the chessbase debate on ratings is even more relevant to this ecf thread. For instance Robert's assertion on the logistic curve is not universally accepted. see:http://www.chessbase.com/newsdetail.asp?newsid=562. I think that the adjustment curve is a very small part of the perceived issues with the current system and needs to be settled after the ecf has set some objectives on what the grading system is to achieve, sets some measures of success and then does some solid research on how to achieve these (something along the lines of Sonas' work).

Stretch will reappear and next year we will get substantial grade inflation which defeats one of the usual objectives - to measure personal progress over time.

Hello Brian, Sonas' graph (where he claims that 'p = f(d)' is better approximated with straight line, say used in GS, rather than logistic curve, say used in Ã‰GS and Ã‰GS2) is shown for a relatively small range of grades (i.e., 'd' is relatively small), and it is a know fact that many nonlinear relationships can be approximated with straight line for relatively small range of values. It would be interesting to see the analysis where 'd' tends to 'Infinity'.

My understanding is that Mr Welch's (Mr David Welch is an ECF Official) finding can be summarized as follows:

It is impossible to measure chess abilities independently of chess performances (there is not a device one can put on the heads of chess players and get a measure of their chess abilities), but assuming that for small differences in chess abilities (say 'd<=30') the relationship between chess performance and difference in chess abilities is linear, one can find (using experimental data) that relationship between chess performance and difference in chess abilities matches logistic curve closely (you assume that GS grades for 'd<=30' are in fact chess abilities, not grades, then you plot discrete experimental points '(d>30, q)' to find that the chess abilities match logistic curve 'p = f(d)' closely, then if you say that 'p = f(d)', where 'd' is grade difference, is logistic curve, grade should be a good measure of chess ability).

Figure 1: Mr Welch's finding. The '(d>30, q)' discrete experimental points match logistic curve 'p = f(d)' closely (please note that the discrete points shown are only for illustration purposes, they are not a result of an actual analysis of the experimental data).

I believe that Ã‰GS2 has a few advantages over GS (not only 'p = f(d)' being logistic curve), but that was discussed in other posts.

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 11:32 am

Surely, taking at least 30 games in calculation makes grades, so to speak, statistically significant, or established, or trusted, but of course, the more games you take into calculation the more trusted the grade

The difference between the ECF system (for greater than 30 games) and the Elo system is that the ECF system ruthlessly measures recent performance regardless of past history. If you want a system which gives you a current ranking order for team competitions, qualification for restricted events and a seeding order for swiss pairings, then this gives this. If it's important that of two players with 150 grades established over 30 games, one used to be 175 and the other used to be 125 then you've got it from the grading histories. I don't see a reason why you should give the fading superstar 163 and the improver 137.

Robert Jurjevic · Post by **Robert Jurjevic** » Thu May 21, 2009 11:49 am

Roger de Coverly wrote:
Surely, taking at least 30 games in calculation makes grades, so to speak, statistically significant, or established, or trusted, but of course, the more games you take into calculation the more trusted the grade
The difference between the ECF system (for greater than 30 games) and the Elo system is that the ECF system ruthlessly measures recent performance regardless of past history. If you want a system which gives you a current ranking order for team competitions, qualification for restricted events and a seeding order for swiss pairings, then this gives this. If it's important that of two players with 150 grades established over 30 games, one used to be 175 and the other used to be 125 then you've got it from the grading histories. I don't see a reason why you should give the fading superstar 163 and the improver 137.

I am not quite sure if I understand your meaning. May I ask if you are trying to establish an optimum number of games to take into account for calculating one's grade, say taking too many games going too far into the past may not be so good as well as taking only a few games played in the season? Thanks.

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 12:17 pm

I am not quite sure if I understand your meaning.

As I understood it, you are advocating a grading system where the published grade at the end of period reflects not just the performance over the period but the published grade at the start of the period. This is a characteristic of Elo type systems but not (with some limitations) the ECF type. Could you clarify - under the grading system you are proposing you have players with start grades of 125 and 175. They both play the same field over 30 games (but not each other) and score a performance of 150. I understand ECF - the new grade is 150 for both. I understand Elo - the new grade is old grade + function of k factor so for low k they don't get grade equality. It was my understanding that you propose a system in which that performance is an average of last published grade for both the player and the opponent. Is this correct?

Brian Valentine · Post by **Brian Valentine** » Thu May 21, 2009 12:38 pm

Robert,
I doubt that we are going to reach an agreement on this. I think that the logistic curve fit has to be substantially superior to the linear approach as simple systems are better understood and accepted.

It is possible that the logistic curve is better at the extremes. However most games are played with a relatively narrow grading difference and this sector of the curve should be "optimised". The Sonas work suggests that in fact linear might be better here. Furthermore I tried to demonstrate in my earlier statistical post that it is a feature of any grading system (that I have seen) that higher graded players must underperform (ie there will always be some stretch). It is this design weakness that may have been what Mr Welch was observing.

I am not advocating that the linear system is perfect, but I don't think it is the major defect in the current system

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 1:01 pm

Stretch will reappear and next year we will get substantial grade inflation which defeats one of the usual objectives - to measure personal progress over time.

The allegation of stretch is that top players ( we may as well use Keith as an example have moved "too" far away from the "average" player. Thus on the existing system Keith is 232 and the "average" (in the sense of as many above as below) is at 111.

Thus there's a spread of 121 points. On v5 of the "revised" grades, Keith is 229 and the "average" player 134. I'd suspect 134 is really high by historic precedent though. The spread is reduced to 95 points. It can be difficult to interpret such extremes but lets try this.

We take Keith or his fellow GMs at 232. They should score 75% against a player rated 207 who in turn should score 75% against a player rated 182. The 182 player should score 75% against players rated 157 who in turn should score 75% against players rated 132. Our players rated 132 should score a bit under 75% against "mr average" at 111.

On the revised grades, Keith is 229. So that's 75% against 204 who scores 75% against 179. The 179 should score 75% against 154 who should score 75% against 129. So there's a layer missed out.

It might take a little while for the inflation to work its way through the system up to the top level. It remains a plausible hypothesis that Keith and friends are in fact 120 points better than the "average" player and that over time the grading system will correct "incorrect" grades. My theory is that Keith and friends have got better over the last 20 years or so and the "average" player hasn't ( or not to the same extent) . So the gap between them was "right" even if the ECF grading team believe otherwise.

We should also remember that there are no absolute values in the "revised" grades. They could equivalently have kept the "average" player at around 111 and taken 25/26 points off the GMs. GMs of 205 would be ridiculed but that's what they've done in terms of relative values.

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 1:11 pm

it is a feature of any grading system (that I have seen) that higher graded players must underperform (ie there will always be some stretch)

Could you try and refute a piece of general reasoning which implies the opposite?

Consider a model where every player plays as many players rated above them as below them. So 175 players play 150 fields and 200 fields. They are supposed to score 75% and 25% to maintain their grades. They still come out at 175 even if they score 70% and 30%. The only players who change are those at the very top (say 250 ) who can only meet a field (225) against whom they score 70% and the very bottom (say 0) who meet a field (25) where they score 30%. Thus the bottom players go up and the top players come down thereby compressing the player range.

Robert Jurjevic · Post by **Robert Jurjevic** » Thu May 21, 2009 1:32 pm

Roger de Coverly wrote:As I understood it, you are advocating a grading system where the published grade at the end of period reflects not just the performance over the period but the published grade at the start of the period. This is a characteristic of Elo type systems but not (with some limitations) the ECF type.

No, the diffrence between Ã‰GS2 (the gading sytem I am advocating) and GS (current grading syyteem) is only in the formulae used, you still grade at the end of each season (once a year) and may use games from previous season (if there were less than games 30 played). It is true that both Ã‰GS2 and Ã‰lo (FIDE rating sytem) use logistic curve for 'p = f(d)', and that FIDE decided to rate after every event (tournament, etc.), but as the choice of 'p = f(d)' is independent of the choice of how often you grade, you can use logistic curve for 'p = f(d)' and grade once a year.

Roger de Coverly wrote:Could you clarify - under the grading system you are proposing you have players with start grades of 125 and 175. They both play the same field over 30 games (but not each other) and score a performance of 150. I understand ECF - the new grade is 150 for both. I understand Elo - the new grade is old grade + function of k factor so for low k they don't get grade equality. It was my understanding that you propose a system in which that performance is an average of last published grade for both the player and the opponent. Is this correct?

Ã‰GS2 will give 138 for 125 player (who scored 35%) and 162 for 175 player (who scored 65%), and this is due to a different formulae used, you still grade once a year. It is true that Ã‰GS2's 'k = 1/2' factor is smaller than GS's 'k = 1', but this is not because you've decided to grade twice a year, but because of the following...

GS rewards 125 player (who scored 65%) and penalizes 175 player (who scored 35%) for 25 points, Ã‰GS2 rewards 125 player (who scored 65%) and penalizes 175 player (who scored 35%) for 13 (in fact 12.5) points, and this is because Ã‰GS2 assumes that 125 player over-preformed (175 player under-performed) because of both, 125 (175) player played above (below) his grade, and the pool of players played below (above) its grade, it is assumed that as much 125 (175) player played better (worse) that much the pool played worse (better), i.e., that both players and the pools contributed equally to the difference between actual and expected performance, in contrast GS assumes that 125 player over-preformed (175 player under-performed) exclusively because 125 (175) player played above (below) his grade and that the pool of players was playing exactly according to its grade.

In Ã‰GS2 a further improvement has been made where both players need not to contribute equally to the difference between actual and expected performance, and how much each of the players contributes is calculated based on the estimate of how much one's grade can be trusted (which is in turn based on the number of games the players played in the last season), it is assumed that the less the grade is trusted the more is the contribution to the difference between actual and expected performance. Consequently, less trusted grades change more rapidly than more trusted grades. (In the above calculation I assumed that the grades of all players were equally trusted.)

In Ã‰GS2, for example, if an ungraded player plays an established player, 'k = 0' for established player and 'k = 1' for ungraded player, i.e. the game result does not affect the grade of the established player, as the ungraded player's grade is basically guessed and cannot be trusted. In Ã‰GS2 'k' is in the range, '0<= k <= 1', and this has nothing to do with how often you grade (you still grade once a year), but rather how much are the players' grades trusted (relatively to each other).

Brian Valentine · Post by **Brian Valentine** » Thu May 21, 2009 1:46 pm

Roger,
I am almost in agreement. My maths (as posted) suggest that stretch will exist and that therefore the "incorrect" grades will not be "corrected". They might have "improved" stretch compared to the old grades, but stretch will appear. If it could have been recognised that stretch is a feature of the system, then the formula for the "correction" would have been different.

Inflation will definitely take place from this change.

Keep up the good work on unearthing the historical facts; getting something robust to assess inflation/deflation is far more important than this minor anomaly.

Brian Valentine · Post by **Brian Valentine** » Thu May 21, 2009 2:01 pm

Roger,
I missed your second post. The issue is quite deep and I will need to work on a short response. The essential feature that causes the effect is that all grades are based on a random performance and should be right only on average. But if a published 200 grade player plays a published 175 then the "true underlying" strengths might be in the ranges 196-204 against 171-179 and this changes the situation.

What happens is that higher published rated players underperform and lower published rated players overperform. To correct for this manifestation, it appears that one needs to narrow the grading gap. But, other things being equal, this is a fallacy.

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 2:05 pm

GS rewards 125 player (who scored 65%) and penalizes 175 player (who scored 35%) for 25 points, Ã‰GS2 rewards 125 player (who scored 65%) and penalizes 175 player (who scored 35%) for 13 (in fact 12.5) points, and this is because Ã‰GS2 assumes that 125 player over-preformed (175 player under-performed) because of both, 125 (175) player played above (below) his grade, and the pool of players played below (above) its grade, it is assumed that as much 125 (175) player played better (worse) that much the pool played worse (better), i.e., that both players and the pools contributed equally to the difference between actual and expected performance, in contrast GS assumes that 125 player over-preformed (175 player under-performed) exclusively because 125 (175) player played above (below) his grade and that the pool of players was playing exactly according to its grade

In my example both players score 50% against the field they played. A hypothesis underlying the the ECF system is that for a sufficient high number of games (currently 30) they are in fact the same strength as evidenced by the fact that they get the same results. That's a straightforward principle and not one that should be discarded lightly. So no I do not want to see these players with equal results given ratings of 138 and 162. In fact I would suggest that the primary purpose of a rating system is to rank players in order of strength (using results as a proxy) with the predictability of results a secondary aim.

In an Elo type system it takes a few more games (depending on the K factor) to discard the hypothesis that the ex-175 player is better than an ex-125 player. There are advantages to Elo type systems not least the relative ease with which one can publish frequently. A downside is the effect of rating lag where for a player whose strength has changed, the published rating takes a while to catch up.

Robert Jurjevic · Post by **Robert Jurjevic** » Thu May 21, 2009 2:23 pm

Brian Valentine wrote:Robert,
I doubt that we are going to reach an agreement on this. I think that the logistic curve fit has to be substantially superior to the linear approach as simple systems are better understood and accepted.

It is possible that the logistic curve is better at the extremes. However most games are played with a relatively narrow grading difference and this sector of the curve should be "optimised". The Sonas work suggests that in fact linear might be better here. Furthermore I tried to demonstrate in my earlier statistical post that it is a feature of any grading system (that I have seen) that higher graded players must underperform (ie there will always be some stretch). It is this design weakness that may have been what Mr Welch was observing.

I am not advocating that the linear system is perfect, but I don't think it is the major defect in the current system

Yes, it would appear that finding a true reason for anomalies in the grading system may be quite a difficult task.

Nevertheless (please note that it seems that Ã‰GS2 has other advantages over GS than using the logistic curve), in Popperian style (after Karl Popper and his definition of scientific theory) one might try to do the following (Karl Popper claimed that basically "a scientific theory is one which can eventually be refuted", and that when there is a new theory what people should do first is try to refute it, of course I do not claim that my 'theory' is so great to deserve this)...

1) Recalculate grades using Ã‰GS2 formulae starting from say season 2003/04 and see if Ã‰GS2 grades are equally stretched as GS grades, the problem with this is that 2003/04 grades (initial grades) might have already been stretched.

2) Staring from season 2009/10 (when the corrected grades will take the effect) keep calculating grades using both GS (official grades) and Ã‰GS2 (unofficial grades kept for the sake of experiment) formulae, for say five years, then see which grades need more correction.

Roger de Coverly · Post by **Roger de Coverly** » Thu May 21, 2009 2:30 pm

But if a published 200 grade player plays a published 175 then the "true underlying" strengths might be in the ranges 196-204 against 171-179 and this changes the situation.

So you've got both a distribution of playing strength for each player and the distribution of results. So one day it's 204 v 171 and the next day 196 v 179.

Is there a belief that there's some non-linear effect that causes the stronger players to be over rewarded for their wins? This would have the effect of widening the gap between them and the "average" player. Certainly the Elo tables are marginally non-linear. I do wonder how you disentangle this effect from the observation that perhaps players are getting stronger but that top players are getting stronger more rapidly. So GMs are widening the gap between themselves and the "average" player whilst the "average" player is gaining a bit on the newcomer to competitive chess. This would also stretch the grading range.

English Chess Forum

GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES

Re: GRADING ANOMALIES