4NCL Online

Venues, fixtures, teams and related matters.
John McKenna

Re: 4NCL Online

Post by John McKenna » Tue Jun 16, 2020 11:34 pm

My take on it (based on A. Elo's original system) -

In the Elo system the interval of a category is defined as one Standard Deviation (1 sigma) [corresponding to a band 200 Elo points wide with 2000 "as a reference point" both "already steeped in tradition" when Elo chose to use them. JM]

Rating categories (based on a 200 point interval and with 2000 as midpoint) -

3000+ Artificial Intelligences
2800+ Human World Champion & contenders
2600+ Strong GMs
2400+ IMs & weaker GMs
2200+ FMs & National Masters
2000+ Candidate Masters & other experts
1800+ Category 1 amateurs
1600+ Category 2 amateurs
1400+ Category 3 amateurs
1200+ Category 4 amateurs
1000+ Novices

Each step up/down from the 2000 midpoint is 200 points and thus 1 sigma (one Standard Deviation).

For a player rated exactly 2000 a -

2200+ performance is 1+ sigma
2400+ performance is 2+ sigma
2600+ performance is 3+ sigma
2800+ performance is 4+ sigma
3000+ performance is 5+ sigma
3200+ performance is 6+ sigma

And so on...

The theory underpinning it is given by A. Elo -

"A player will perform around some average level... Deviations occur... (with) large deviations less frequently than small ones. These facts suggest the basic assumption of the Elo system -

The many performances of an individual will be normally distributed..." [as shown by the well-known symmetrical bell curve, as given further above in this thread. JM]

"Extensive investigation (Elo 1965, McClintock 1977) bore out the validity of this assumption. Alternative assumptions are discussed..." [elsewhere - JM]

"Statistical and probability theory provides a widely used measure of these performance spreads [deviations from the average - JM] a measure which has worked quite well for many other natural phenomena... Standard Deviation." [1 SD is denoted by 1 sigma - JM]

"The central bulk [of the bell curve] - about two-thirds [68% JM] - of an individual's performances lie within 2 Standard Deviations" [i.e. minus 1 to plus 1 SD (-1 sigma to +1 sigma. And that leaves 32% of the player's performances outside of that central bulk. With 16% higher than +1 sigma and the remaining 16% being lower than -1 sigma].

Q.E.D. (At least I'd like to think so.)
IM Jack Rudd wrote:
Tue Jun 16, 2020 10:40 pm
Be careful with this analysis: five-round tournaments and nine-round tournaments will probably show significantly different numbers when it comes to variation around a TPR.
Thanks for the warning, Jack.

Is this the kind of thing you are warning about?
Matthew Turner wrote:
Mon Jun 15, 2020 4:22 pm
DavidWalker wrote:
Mon Jun 15, 2020 3:38 pm
Does a z-score of 4 really equate to an 800 point rating boost? I know that the ELO standard deviation is supposed to be 200, but this z-score is in a different domain (basically measuring actual moves matched against expected matches I believe). There is a draft paper by Dr Regan in which he gives rating estimates for players based on moves played in their games. Randomly checking a few examples from this paper shows a SD of the rating estimates lower than 200.
So, the 200 points per one of z is a rule of thumb, but generally it relates pretty accurately to the Regan tests that we are talking about. The difference with the numbers in the paper you quote are down to sample size.
So for example we have
Steinitz in the World Championship match 1886 Est. performance 2352 (2 sigma range) 2150–2553 No of moves 593 (that is from 20 games)

If we take an example from fewer games we have
Larsen Candidates 1971 Est. performance 2187 2 sigma range 1702 - 2602 No of moves 181 (that is from 6 games)
Last edited by John McKenna on Wed Jun 17, 2020 9:27 am, edited 3 times in total.

User avatar
IM Jack Rudd
Posts: 4826
Joined: Tue Apr 17, 2007 1:13 am
Location: Bideford

Re: 4NCL Online

Post by IM Jack Rudd » Tue Jun 16, 2020 11:57 pm

It is, yes. Good example.

DavidWalker
Posts: 29
Joined: Tue May 05, 2020 4:01 pm

Re: 4NCL Online

Post by DavidWalker » Wed Jun 17, 2020 8:48 am

MartinCarpenter wrote:Yes, so long as the tool knows that you're in known theory..... Which must be a genuinely very difficult task these days.
True. I remember playing the following line against Francis Rayner at Scarborough in 2015.


The position after 11. f4 is quite striking, which is why (during the game) I remembered that it was "theory". It is analysed in volume 1, chapter 19 of Mihail Marin's English Opening trilogy. However, even 8. d5 is not mentioned in Megabase 2019, and Marin's main line continues to move 16, so presumably none of this would count as theory for anti-cheat software.

User avatar
Michael Farthing
Posts: 2069
Joined: Fri Apr 04, 2014 1:28 pm
Location: Morecambe, Europe

Re: 4NCL Online

Post by Michael Farthing » Wed Jun 17, 2020 8:49 am

Joseph Conlon wrote:
Tue Jun 16, 2020 6:53 pm
The graph above is the real deal; saying something is 2 sigma, or 3 sigma, or 4 sigma, is a measure of how often something is likely to happen.
Let me put another (identical, but differently phrased) scale to this:

Suppose chess player Knightie McKnightFace plays six tournaments a year and has a stable rating. Then

1 sigma outperformance = happens 1 time in 6, the best tournament performance they play that year

2 sigma outperformance = happens 1 time in 50, a once-in-a-decade performance, one of the few best tournaments of their life

etc
This is not accurate. The probabilities apply to the entire population not to an individual. As it happens Knightie McKnightFace is a battling alcoholic and throughout his life has struggled manfully with the demon drink. He has periods of success when he doesn't touch a drop and performs magnificently and other times when some personal trauma knocks him back, he drinks profusely and under-performs. Thus, his results are usually 1 sigma and frequently 2 sigma away from the mean based on the population distribution curve. If his own personal curve were drawn it would be nowhere near a normal curve but would be rather like the one Mike Gunn* described for himself - two peaks at a distance either side of the mean.

These innumerable possibilities for personal differences is one reason why a right of appeal is so essential - quite apart from the fundamental principle that even if perfect the evidence should always be subject to public scrutiny so that justice is seen to be done.

*In an earlier post in this very long thread (I think it's this thread though I haven't attempted to find it). My post here does not offer evidence and does not imply that Mike resembles Knightie. I know him (Mike not Knightie) personally and can vouch for him. I'm sure Mike would not dream that I was maligning him, but sadly others might!

User avatar
Adam Raoof
Posts: 2720
Joined: Sat Oct 04, 2008 4:16 pm
Location: NW4 4UY

Re: 4NCL Online

Post by Adam Raoof » Wed Jun 17, 2020 9:22 am

Michael Farthing wrote:
Wed Jun 17, 2020 8:49 am
Joseph Conlon wrote:
Tue Jun 16, 2020 6:53 pm
The graph above is the real deal; saying something is 2 sigma, or 3 sigma, or 4 sigma, is a measure of how often something is likely to happen.
Let me put another (identical, but differently phrased) scale to this:

Suppose chess player Knightie McKnightFace plays six tournaments a year and has a stable rating. Then

1 sigma outperformance = happens 1 time in 6, the best tournament performance they play that year

2 sigma outperformance = happens 1 time in 50, a once-in-a-decade performance, one of the few best tournaments of their life

etc

These innumerable possibilities for personal differences is one reason why a right of appeal is so essential - quite apart from the fundamental principle that even if perfect the evidence should always be subject to public scrutiny so that justice is seen to be done.
So you are in favour of publishing the results of the analysis - after every event?

It is actually possible to detect engine use during a game, this is being done regularly. On these larger playing sites there is no human monitoring or games, but on a site like Tornelo it should be possible for games to be flagged as suspicious as an event progresses. Then an arbiter can check the games.
Adam Raoof IA, IO
Chess England Events - https://chessengland.com/
The Chess Circuit - https://chesscircuit.substack.com/
Don’t stop playing chess!

Roger de Coverly
Posts: 21315
Joined: Tue Apr 15, 2008 2:51 pm

Re: 4NCL Online

Post by Roger de Coverly » Wed Jun 17, 2020 9:33 am

Adam Raoof wrote:
Wed Jun 17, 2020 9:22 am
It is actually possible to detect engine use during a game, this is being done regularly.
Isn't what is actually being detected that the player's choice of moves is similar or identical to that of an engine? That does not of itself prove that an engine was consulted during the game as alternative explanations are pre game investigation, skill or chance.

John McKenna

Re: 4NCL Online

Post by John McKenna » Wed Jun 17, 2020 9:54 am

Michael Farthing>*In an earlier post in this very long thread (I think it's this thread though I haven't attempted to find it).<

It's in the "Some thoughts on anti-cheating systems" thread (only 14 pges long) somewhat earlier than my post here -

viewtopic.php?f=2&t=10832&p=245746&hili ... is#p245746

Matthew Turner
Posts: 3604
Joined: Fri May 16, 2008 11:54 am

Re: 4NCL Online

Post by Matthew Turner » Wed Jun 17, 2020 10:37 am

DavidWalker wrote:
Wed Jun 17, 2020 8:48 am
MartinCarpenter wrote:Yes, so long as the tool knows that you're in known theory..... Which must be a genuinely very difficult task these days.
True. I remember playing the following line against Francis Rayner at Scarborough in 2015.


The position after 11. f4 is quite striking, which is why (during the game) I remembered that it was "theory". It is analysed in volume 1, chapter 19 of Mihail Marin's English Opening trilogy. However, even 8. d5 is not mentioned in Megabase 2019, and Marin's main line continues to move 16, so presumably none of this would count as theory for anti-cheat software.
David,
This is a really good example. Lets use a simplified version of the Regan test (like his initial screening) and you can either match a computer move or not. Lets assume that your opponent plays down this line until move 16. We'll make some assumptions and say you are an average player so your expecting move matching would be 50%. Lets also say this is an average game with 30 usable moves. I hope that is all very conservative, because you are in fact a strong player and I don't know if Francis actually played down the line to move 16.

Over your 30 moves you would be expecting to match 15 (50%), but with your 9 perfect matches at the start your expectation leaps to 19.5 (9 + 21*0.5) which is 65%. Anything in the 40 - 60 range is 'normal', so your 65% would be of interest, but people caught in ECF events are generally around the 70 mark.

If you had 3 or 4 games like this in a row (not the same variation, but similar lengths of unknown theory) then you would probably be caught by a Regan test. However, I think the chances of that run into the billions to one.

Matt

Mike Gunn
Posts: 1025
Joined: Wed Apr 11, 2007 4:45 pm

Re: 4NCL Online

Post by Mike Gunn » Wed Jun 17, 2020 10:41 am

Here is my post about my two humped distribution of my exam marks: viewtopic.php?f=2&t=10832&p=245701#p245701. Just coincidentally (during a rather boring section of a Skype meeting yesterday) when catching up with my university emails I found one stating that attempts to "normalise" this semester's marks had been abandoned.

As Michael said, the standard deviation of the performance of some players is much higher than the standard "sigma" value of 200 Elo points (or 25 ECF grading points). A friend of mine at the chess club is always pointing out that he often loses to much weaker players and defeats much stronger ones. Indeed on Monday evening I drew with a player 2 sigmas above me in the club championship, but it is only the second time I have achieved this result in 26 years.

Grading theory works well most of the time but Elo acknowledges in his book that the tails of the normal distribution may not be too good for chess performance (and says the logistic distribution would be better for extreme performances). However he then goes on to say that the normal distribution is good enough for most purposes. When it comes down to it the grading system is based on averages and a probability function related to the class interval. Both Elo and Clarke recognised that their systems produced (very close to) equivalent results when fed the same data.

As I posted in the other thread a career's worth of having to grapple with the normal distribution in my professions as an engineer and teacher immediately makes me suspicious when talking about 4 sigmas or 5 sigmas. We are in Taleb's "Black Swan" territory here.

Roger de Coverly
Posts: 21315
Joined: Tue Apr 15, 2008 2:51 pm

Re: 4NCL Online

Post by Roger de Coverly » Wed Jun 17, 2020 11:10 am

Matthew Turner wrote:
Wed Jun 17, 2020 10:37 am

If you had 3 or 4 games like this in a row (not the same variation, but similar lengths of unknown theory) then you would probably be caught by a Regan test. However, I think the chances of that run into the billions to one.
If you know lots of obscure theory, the chances that your games reach positions where such knowledge is important are not independent. So ask the Bayesian question as to what are the chances of a another game of sharp theory, given that such a knowledge has already been demonstrated?

Matthew Turner
Posts: 3604
Joined: Fri May 16, 2008 11:54 am

Re: 4NCL Online

Post by Matthew Turner » Wed Jun 17, 2020 11:33 am

Roger,
We are talking about knowing theory that no 2300 has ever played (in a game in a database) and and our opponent playing down the whole line and then that being repeated again and again. We don't even know if David's example meets that criteria (because we don't know how the game continued). Please try to give examples where this happened if you want to, but you simply won't be able to.
False positives happen, but the idea that you can systematically generate false positive by knowing lots of obscure opening theory is just wrong.

Roger de Coverly
Posts: 21315
Joined: Tue Apr 15, 2008 2:51 pm

Re: 4NCL Online

Post by Roger de Coverly » Wed Jun 17, 2020 11:44 am

Matthew Turner wrote:
Wed Jun 17, 2020 11:33 am
False positives happen, but the idea that you can systematically generate false positive by knowing lots of obscure opening theory is just wrong.
That appears to have been Justin Horton's experience when playing "Daily" on chess.com. In his case, the theory wasn't that obscure, living as it did in his bookcase. It was enough for the chess.com computer to make allegations on a relative handful of games. It should not be any surprise that recently written opening books would contain a high level of engine matching as you would hope at the very least that outright tactical errors had been screened out.
Last edited by Roger de Coverly on Wed Jun 17, 2020 12:06 pm, edited 1 time in total.

Joseph Conlon
Posts: 339
Joined: Thu Jun 06, 2019 4:18 pm

Re: 4NCL Online

Post by Joseph Conlon » Wed Jun 17, 2020 12:06 pm

Michael: I don't know whether the distribution used by Ken Regan (for move matching rates) is normal. But here I don't think this is key, because what matters is the probabilities of rare events, and given he has analysed pretty much every major tournament that ever existed (or claims to) he can take his distribution directly from data - what really matters is how much something is an outlier, not whether the distribution is normal.

My use of 1-sigma, 2-sigma refers more to the probabilities they represent, and less to a claim as to whether the distribution of move-matching is normal.

But in terms of performance fluctuations, I do think one can regard 2 sigma fluctuations *in underlying strength* as much easier than they actually are. In reference to another thread, a real 2400 even if drunk, unprepared and suffering with flu does not play like a 1200. Magnus Carlsen even when actually drunk is still one of the few best bullet players in the world (DrDrunkentstein).

If we take 25 ECF, or 200 ELO, as a standard deviation in playing strength, the claim that there are players with stable ratings who can fluctuate on their raw strength by 2 standard deviations between tournaments is equivalent to the claim that there are stable 2200s who - when they are on form - could be expected to hold their own in e.g. the Grand Swiss and get 50% or better. Personally I do not believe this for a moment.

Certainly one needs appeals processes with experienced humans; but 4 sigma - 1in 30,000 - performances really are extraordinary and should be scrutinised.

Joseph Conlon
Posts: 339
Joined: Thu Jun 06, 2019 4:18 pm

Re: 4NCL Online

Post by Joseph Conlon » Wed Jun 17, 2020 12:11 pm

Roger de Coverly wrote:
Wed Jun 17, 2020 11:44 am

That appears to have been Justin Horton's experience when playing "Daily" on chess.com. In his case, the theory wasn't that obscure, living as it did in his bookcase. It was enough for the chess.com computer to make allegations on a relative handful of games. It should not be any surprise that recently written opening books would contain a high level of engine matching as you would hope at the very least that outright tactical errors had been screened out.
Online correspondence chess is entirely sui generis though, in that you are allowed to consult any published resource but not allowed to turn on an engine to analyse the game. So if you are playing down some previous game for which you printed out 36 hours of stockfish analysis using a huge number of cores, you can use the print-out and all the lines, you just can't turn the engine on while you are playing. Or likewise following down a 40 move Stockfish vs Stockfish game contained in your favourite opening book is perfectly fine.

Yes, it makes no sense to me either.

But for games played under normal rules, what Matt says make perfect sense - if something has never happened over the board, then why (without assistance) should it happen for an online game?

Roger de Coverly
Posts: 21315
Joined: Tue Apr 15, 2008 2:51 pm

Re: 4NCL Online

Post by Roger de Coverly » Wed Jun 17, 2020 12:14 pm

Joseph Conlon wrote:
Wed Jun 17, 2020 12:06 pm
the claim that there are players with stable ratings who can fluctuate on their raw strength by 2 standard deviations between tournaments is equivalent to the claim that there are stable 2200s who - when they are on form - could be expected to hold their own in e.g. the Grand Swiss and get 50% or better. Personally I do not believe this for a moment.
At the 2200 level, then no. At the 1400 level, why not? It's not as if 1800 players are that good.