I have been playing with something to try and detect annoying typos. It uses a natural language processor (at the moment I’m using Stanford Parser) to form a transition matrix for the Parts of Speech (POS) I use. For input data, I used my writing from this blog.
At the end of the last post, I bought data from java into R to draw a heatmap, to visualise the transition matrix. This process was a little annoying though (and I find R a little annoying anyway). I found JHeatChart, a Java heat map implementation, to smooth the process a little bit. JHeatChart did not quite give the result I was after, so I made a few modifications here:
- There are quite a few POS combinations that never occur and so that cell is zero in the matrix. I wanted these to be clear from combinations that do occur, but rarely.
- The distribution of probabilities in the matrix is quite imbalanced there are some combinations (e.g ‘the’ in front of a noun) that occur orders of magnitude than the majority. The outcome in the standard implementation was most of the cells being the indistinguishable shades of the ‘low’ colour with a couple being the ‘high’ colour but nothing in between. To address this I added the ability to plot log values of the data to even things out a little in the render.
Other Writers’ Maps
I have since written a new preprocessor that takes (txt format) e-books from Project Gutenberg and formats them for my processor. It is all a bit crude, so in keeping, the new pre-processor strips out all the paragraphs with speech in (around 50% of them in Old man and the Sea). I reasoned that I don’t really write direct speech, and it makes the parser a little more tricky if I try and include speech. The remaining text is then split into line-by-line sentences, same as I did for my blog text. The parser considers each sentence independently without trying to get (grammatical) context from the surrounding text – I discussed this in the last post.
The maps below are for:
- Hemingway’s The Old Man of the Sea,
- The Wonderful Wizard of Oz by L. Frank Baum, and;
- Darwin’s On the Origin of the Species
Passing a cursory glance at them, all four maps are relatively similar. This is not unexpected – for prose to make ‘sense’ (however badly written it is in my case, or well in the others) it has to follow the rules of grammar. It’s difficult (for me anyway, knowing little technically about languages) to compare the four maps and derive anything particularly insighful – if you do look at them more closely they are all subtly different though. The Wizard of Oz one seems to be sparser than the others – perhaps this is because the grammar is simpler? It is a children’s book. Don’t know.
The Old Man and the Sea