Back to Index

Introduction to MONK Analytics

Who this is for

This tutorial is written for users in Literary Studies or cognate disciplines. We assume that you have no background in statistics but may be interested in exploring some quantitative routines. You are probably not very interested in the mathematical formulae that underly these routines. You would like to know how to run an experiment and how to evaluate the results. In this regard you are like many researchers in many other disciplines. Before computers or pocket calculators, statistical analysis was a very time consuming business, and you could not really do it without a good understanding of how to prepare data and do calculations. Nowadays you can take a spreadsheet, click on one or two buttons, and get results within seconds. The statistical routines have become faster and more complex, but they are also much more likely to be a black box to you. You put in something and something else comes out, but you have no idea what happened inside. This is an unavoidable part of modern life, but as a user of any black box you still need to know what it is you put into it, and how to make sense of what comes out of it.

Texts and bags of words

The texts you are interested in consist of words in a certain order, which you 'read'. You do not read everything from beginning to end. In fact, in any research-focused environment, reading from beginning to end is the exception, and various forms of partial or discontinuous reading are the norm. But whether you read from the first to the last page or skip, to read means to process words in the order in which they appear in the text.

In the various 'analytics' or statistical routines in MONK, the reading order is deliberately ignored. A text is reduced to a list words with their counts. This is the 'bag of words' model. In a digital version of a text all the letters or characters are replaced by numbers. The space between words is just another character with a number of its own. The computer has no understanding of what a word is, but it follows instructions to 'count as' a word any string of alphanumerical characters that is not interrupted by non-alphabetical characters, notably blank space, but also punctuation marks, and some other symbols. 'Tokenization' is the name for the fundamental procedure in which the text is reduced to an inventory of its 'tokens' or character strings that count as words. This is an extraordinarily reductive procedure. It is very important to have a grasp of just how reductive it is in order to understand what kinds of inquiry are disabled and enabled by it.

A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata'. Take something like 'hee louyd hir depely'. This comes to exist in the MONK textbase as something like

hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep

Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word. The 'words' in the 'bag' are therefore analyzable at different levels of abstraction.

While the bag of words model deliberately ignores word order, you can restore a little of it by constructing 'n-grams' or minimal word sequences. MONK stores lemma bigrams like 'in the' or 'beauteous majesty'. It also stores part-of-speech trigrams, such as dt-j-n1, the syntactic pattern of 'the old man'. Linguists have found that such syntactic fragments provide useful evidence for many inquiries. For most practical purposes, MONK thus offers you five different word bags: spelling bags, lemma bags, POS bags, lemma bigram bags, and POS trigram bags. These bags offer the raw data for various statistical analyses.

Basic facts about common and rare words

In any statistical inquiry, phenomena from the real world are transformed into data points that are submitted to algorithms. From the perspective of the algorithm, it does not matter whether the data points are leukocytes, antigens, lemmata, or POS tags. On the other hand, algorithms make some assumptions about how data are distributed. Some algorithms work better with some forms of distribution than others.

A very casual look at any word frequency list shows a pattern that holds up with remarkably constancy across different texts in different languages. The pattern is very clearly described in this paragraph from the Wikipedia about Zipf's law:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

In the Shakespeare corpus, the 135 most common words account for almost two thirds of some 850,000 word occurrences. It works equally well in other languages: in Early Greek epic, the 135 most common words account over half of the 231,000 word occurrences. If you look at the top 1,500 words, they account for not quite 90% of all word occurrences in Early Greek epic and a little more than 90% in Shakespeare.

Some very important consequences for scholarly text analysis follow from a reflection on these very simple and quite universal facts about written language. A text of any length consists of a few words that are very common and a lot of words that are quite rare. Hapax legomena or words that occur only once make up a third of the vocabular of Early Greek epic and almost 40% of the Shakespeare lexicon. If you want to use a "textile" metaphor, you might say that a text is like a spangled dress. The common words weave the basic fabric, and the rare words are the decorative spangles.

It follows from this that a bag of words model for a text can attend to two different phenomena: the relative frequency of the common words and the presence of rare words. Amazon's recommendations for books you might like are based on algorithms that look for 'statistically improbable phrases' (SIP) or rare words that are shared between a book you have bought from them and the books they recommend.

For some forms of inquiry, very common words add little information. The Linguistics Department at the University of Glasgow maintains a popular list of 319 'stop words'. If you reduce a text by keeping only those words and mapping all the others to a place holder, you would have a very hard time finding out what it is about. If, on the other hand, you threw away all the stop words and reduced the text to a word frequency list of what remains you would have a pretty good idea of what it is about.

For many other inquiries, however, the so-called stop words are critical. Words like 'whilst' and 'by' provided the critical evidence for Mosteller's and Wallace's famous analysis of differences in usage between Madison and Hamilton in the Federalist Papers. If you look for linguistic features that vary with genre, the odds are that differences in the relative frequency of very common words provide the most powerful evidence. When John Burrows analyzed differences in the way Jane Austen's characters talk, he discovered that most of the work of characterization was done by differences in the distribution of just thirty words.

This is a good example of the enabling power of the bag of words model. It is very easy to think of all the things that are brutally thrown away by it. On the other hand, it makes it much easier to attend to effects that arise from the frequency of low-level phenomena. It is very difficult to define these with any precision by reading, and it is a very tedious and error-prone business to count them by hand.

Text analytics

An 'analytic' is a procedure that returns some result on the basis of the data you feed it. The analytics used in MONK all work more or less by comparing two or more pre-defined word bags. When you define a 'work set' you choose the parameter for a word bag. When you choose a particular analytic you have an additional option of defining the tokens in your word bag as spellings, lemmata, or POS tags, and you may include or exclude certain kinds of words.

While the analytics differ in their formulas, they follow a basic model. You observe the distribution of tokens in one word bag. You pretend that this word bag was generated by randomly drawing words from some 'super word bag'. You now approach the second word bag with expectations about the distribution of its words on the assumption that it was drawn at random from the same super word bag. You measure the difference between that expectation and what you actually observe in the second bag. You take a special interest in those words where the difference between the expected and the observed falls outside an expected range of random variation.

There are various refinements to these comparisons. If you are interested in the content of a text, narrowly construed, it makes sense to throw away stop words. If you are interested in stylistic properties, differences in the frequencies of common words are likely to be crucial evidence. From a statistical perspective, the high number of rare events is a distinctive property of language. You may want to ignore rare words, because there are so many of them and you cannot draw reliable inferences from differences of frequency. Sometimes a 'binary' approach is useful. There may not be a point to the fact that a given token appears once here and three times there. But it may be significant that it occurs at all both here and there.




Go to the Index Page or back to the Top of this page