Back to Index

Analytics: Naive Bayes

Naive Bayes belongs to a group of statistical techniques that are called 'supervised classification' as opposed to 'unsupervised classification.' In 'supervised classification' the algorithms are told about two or more classes to which texts have previously been assigned by some human(s) on whatever basis.

The examination by Mosteller and Wallace of disputed essays in the Federalist Papers is a famous example of this technique. The authorship of most of the essays is firmly established through external evidence, but with regard to a dozen essays it is unclear whether they were written by Hamilton or Madison. Mosteller and Wallace used this problem as a guinea pig to test the efficacy of Bayesian statistics. They started with the known facts about lexical and other verbal preferences by making a 'Madison bag' and a 'Hamilton bag' of word frequencies. Then they turned each disputed essay into an 'unknown bag' and asked whether and it how it was more like Madison than Hamilton.

This is like Dunning's in some ways and quite different in others. It is like Dunning's in that it involves the comparison of two word bags and the identification of particular words that discriminate sharply between the two bags, although it uses different ways of measuring and expressing the discriminating power of particular 'features'.

Naive Bayes is quite different from Dunning's in that it uses what it has 'learned' from the comparison of two (or more) word bags to the classification of unknown texts. Was this novel written by a man or a woman? If you use email, some spam filter along the line probably uses Bayesian or similar statistics to classify your email for you.

Naive Bayes is called 'naive' because it pretends that the variables it examines are independent of each other even when this is manifestly not the case. It is not so naive in its discovery that the false assumption of independence, which makes it much easier to compute multiple variables, in practice works well enough.

A Bayesian classification is a probability judgment --like just about everything else in this world -- and it takes the form of "on the basis of the combined result of the discriminating there is a 77% chance that this text is an X rather than a Y." There are several reasons for using a test of this kind. The most obvious is to put texts in some cubby-hole: "this essay is by Madison" or "this email is spam." You may also be interested in the features that distinguishes one class of texts from another, although Dunning's may do this more easily, as long as you have only two groups of texts. Thirdly, if you are interested in phenomena that are likely to exist on a continuum, such as 'Gothic' or 'sentimental' novels, you may use the confidence values of Naive Bayes to arrange texts on a spectrum.

Misclassifications are of particular interest in this context. There are several statistical tests, including Naive Bayes, that misclassify Othello as a comedy. This is not a failure of the test, but striking statistical confirmation that, to borrow from the title of an utterly non-statistical essay, Othello is a Roman comedy turned nightmare.

There are some cautions to take, and they emerge clearly from another look at Mosteller and Wallace, whose results have been universally accepted. It is worth repeating that the real telltale indicators were little words like 'whilst' and 'by'. It was helpful that the results of the statistical were very compatible with tentative conclusions that historians had arrived at by non-statistical evaluations of internal and external evidence.

As is often the case with good statistical work, the results provide different and firmer evidence for what is known or suspected on the basis of other inquiries. The French mathematician Laplace has a wonderful sentence about this in his "Philosophical Essay about Probability":

On voit, par cet Essai, que la théorie des probabilités n'est, au fond, que le bon sens réduit au calcul; elle fait apprécier avec exactitude ce que les esprits justes sentent par une sorte d'instinct, sans qu'ils puissent souvent s'en rendre compte.
You see from this essay that the the theory of probability is common sense reduced to calculus; it makes you appreciate with exactitude what judicious minds have sensed by a kind of instinct without being able to give an account of it.

Mosteller and Wallace had very good 'training data' and quite small 'test data'-- a ratio of about 6:1 The training data consisted of dozens of essays each by Madison and Hamilton. The Federalist Papers belong to a genre of 18th century political writing very clearly defined in terms of topic and scope. The dozen essays in the test data were like the training data in just about every respect. So this was a very controlled experiment in which you could expect that variables would bear directly on the issue at hand.

If you are literary scholar and you do not have a great passion for authorship attribution but may be interested in identifying properties of a genre, you will have a lot of difficult and time-consuming choices to make. You will rarely encounter a situation in which your training data and your test data are defined so clearly as they were for the Federalist Papers, where there was no doubt about the number, scope, or length of the data. You will be in a world of more variables and very fuzzy edges. Consequently, the interpretation of results will require considerable caution.


Sources:

Mosteller, Frederick and Wallace, David L. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. New York: Springer-Verlag, 1984.

Mosteller, Frederick and Wallace, David L. "Inference in an Authorship Problem. A Comparative Study of Discrimination Methods Applied to the Federalist Papers." Journal of the American Statistical Association (58) 1963. 275-309.




Go to the Index Page or back to the Top of this page