stanford nlp (coursera) notes (6) – text classification and naive bayes

Text Classification and Naive Bayes

This time I’m going to talk about text classification. It has many applications:

  • detect spam;
  • authorship identification;
  • gender identification;
  • sentiment analysis;
  • detect subject of the article.

I’ll use machine learning jargon for this post since it has big overlap with machine learning.

1. Text Classification

Definition: we have input text (x) and set of classes (C={c_1, c_2, dots, c_n}), and the output is predicted class (yin C) given by a classifier.

There are mainly two kinds of classifiers:

1.1 Hand-coded rules

Use rules based on domain knowledge: combinations of words or other features. The accuracy could be high if rules are carefully designed, also not too much data needed, but build it could be expensive.

1.2 Supervised learning algorithms

Including many classifier you know: Naive Bayes, SVM, Logistic Regression, KNN, LDA/QDA etc. It needs a training set which contains text and labels.

2. Naiver Bayes

This is a very simple method and can be a good start. Before modeling, we need first convert raw texts to structured format. Bag-of-words could be a solution (but lose sequence information). Each text will represented as a vector with length to be the number of words in the corpus, it will be a sparse vector.

Then we want to calculate the probability of text class given by the content (P(y|x)), we can apply Bayes rule: where the posterior is proportional to likelihood times prior:

[
P(y|x) = frac{P(x|y)P(y)}{P(x)} propto P(x|y)P(y).
]

And the predicted class should maximize posterior (MAP)

[
hat{y} = argmax_{yin C} P(y|x) = argmax_{yin C} P(x|y)P(y),
]

where

[
P(x|y) = P(x_1, x_2, dots, x_n|y).
]

For prior (P(y)), we can simply count the relative frequencies in the training data; for likelihood (P(x_1, x_2, dots, x_n|y)) it will be complex. Therefore comes Naive Bayes Independence Assumption, we assume each feature given the class is independent:

[
P(x_1, x_2, dots, x_n|y) = P(x_1|y)P(x_2|y)dots P(x_n|y),
]

therefore the prediction of Naive Bayes classifier is

[
hat{y} = argmax_{yin C} P(y)prod^{n}_{i=1} P(x_i|y),
]

then we can estimate

[
hat{P}(y_j) = frac{count(C=c_j)}{text{number of texts}}, ~
hat{P}(x_i|y_j) = frac{count(x_i, y_j)}{sum_i count(x_i, y_j)}.
]

In practice, we can also apply smoothing methods to deal with zero-probability issue and unknown words issue:

[
hat{P}(x_i|y_j) = frac{count(x_i, y_j) + 1}{sum_i count(x_i, y_j) + V}.
]

3. Relationship to Language Models

In previous section we only use bag-of-words features for Naive Bayes classifier, therefore we are actually regard it as a language model: each class is a unigram model.