- Typo Detection
- Typo Detection Part II Other Writers
- Typo Detection Part III Comparing Matrices with the Vector Distance
- Typo Detection Part IV: Comparing Matrices with The Pearson coefficient.
- Typo Detection Part V: Comparing Matrices
- Typo Detection Part VI: End of the comparisons, for now
I cut the last post, where I was looking at ways to quantify the similarity between two matrices, I cut things short – the post was getting too long-winded and math-ey. This one might be a little too (we’ll see)…but hopefully, it’ll be short.
I covered using the matrix norm for comparing my probability matrices in the last post, but I came across another way using a version of Pearson Correlation Coefficient as another single scalar number. I also found that similarity matrices were another way. Since I do not know what I am doing I thought I would put all my options in the wash and see what came out the other end.
I am calling it corr2 because it a MathLab function called that, part of the image processing suite. The equation for it (from the Mathworks website) is:
I wrote a version in Java here (turns out I needn’t have done, read on).
Reading around on what exactly that formula is/does, I have concluded it’s ‘just’ the Pearson correlation coefficient’ – the ubiquitous ‘r’. Microsoft Excel has the Pearson correlation coefficient, as function PEARSON(), and CORREL(), but more familiar to me, the R² when you do those scatter chart lines-of-best-fit. We all added trendlines to our graphs at school and uni right, and got the R² as the judge of how well things line up, right…which was invariably an indicator of whether we were going to get a good mark. Well, THAT R² is the Pearson correlation coefficient (squared).
From what I understand(I do not ‘get’ stats and probabilities generally) the Pearson correlation coefficient is some sort of normalised quantification (using the standard deviations to normalise) of how closely there is a one-to-one mapping (the covariance) between the two datasets, in my case the matrices I am thinking about. On this basis it seems to be a prime contender as Miss Right-Way-of-Doing-It for comparing my transition matrices.
‘Incidentally’ footnote: No 1
In this post, Microsoft kind of admit that, pre-Excel 2003, the PEARSON function and a few other statistical functions (in some circumstances) were buggy and ‘a little off’. It’s a strange notion right?
‘Incidentally’ footnote: No 2
Karl Pearson: It seems a little sad he has been reduced, but not credited, to an ‘r’. He got busy in a lot of fields. Unfortunately, one of those was Eugenics: if there were ever a way to get your name scrubbed out of history books, an interest in Eugenics would be it. Someone did write his biography, maybe one day I’ll read it…