Typo Detection Part IV: Comparing Matrices with The Pearson coefficient.


This entry is part 4 of 6 in the series Typo detection
Hey. This page is more than 7 years old! The content here is probably outdated, so bear that in mind. If this post is part of a series, there may be a more recent post that supersedes this one.

I cut the last post, where I was looking at ways to quantify the similarity between two matrices, I cut things short – the post was getting too long-winded and math-ey. This one might be a little too (we’ll see)…but hopefully, it’ll be short.

I covered using the matrix norm for comparing my probability matrices in the last post, but I came across another way using a version of Pearson Correlation Coefficient as another single scalar number. I also found that similarity matrices were another way. Since I do not know what I am doing I thought I would put all my options in the wash and see what came out the other end.

Corr2 Correlation

I am calling it corr2 because it a MathLab function called that, part of the image processing suite. The equation for it (from the Mathworks website) is:

I wrote a version in Java here (turns out I needn’t have done, read on).

Reading around on what exactly that formula is/does, I have concluded it’s ‘just’ the Pearson correlation coefficient’ – the ubiquitous ‘r’. Microsoft Excel has the Pearson correlation coefficient, as function PEARSON(), and CORREL(), but more familiar to me, the R² when you do those scatter chart lines-of-best-fit. We all added trendlines to our graphs at school and uni right, and got the R² as the judge of how well things line up, right…which was invariably an indicator of whether we were going to get a good mark. Well, THAT R² is the Pearson correlation coefficient (squared).

This snapshot of a quick excel chart I made is an epiphany for me.

From what I understand(I do not ‘get’ stats and probabilities generally) the Pearson correlation coefficient is some sort of normalised quantification (using the standard deviations to normalise) of how closely there is a one-to-one mapping (the covariance) between the two datasets, in my case the matrices I am thinking about. On this basis it seems to be a prime contender as Miss Right-Way-of-Doing-It for comparing my transition matrices.

‘Incidentally’ footnote: No 1

In this post, Microsoft kind of admit that, pre-Excel 2003, the PEARSON function and a few other statistical functions (in some circumstances) were buggy and ‘a little off’. It’s a strange notion right?

I find ‘improved’ a strange word to be using. “Oh, 2+3 was coming out at 3.5, but we’ve since improved it…it comes out as 4 and a bit now”

‘Incidentally’ footnote: No 2

Karl Pearson on the left, Francis Galton on the right, Darwin’s (half) cousin. The two of them, unfortunately, pioneered a slightly iffy social Darwinism (Galton coined the word Eugenics)

Karl Pearson: It seems a little sad he has been reduced, but not credited, to an ‘r’. He got busy in a lot of fields. Unfortunately, one of those was Eugenics: if there were ever a way to get your name scrubbed out of history books, an interest in Eugenics would be it. Someone did write his biography, maybe one day I’ll read it…

 

 

Series navigation

<< Typo Detection Part III Comparing Matrices with the Vector DistanceTypo Detection Part V: Comparing Matrices >>