We want to predict whether a review is negative or positive, based on the text of the review. We’ll use Naive Bayes for our classification algorithm. A Naive Bayes classifier works by figuring out how likely data attributes are to be associated with a certain class. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Resources:

Organization:

What is Naive Bayes?

Naive Bayes classifier is based on Bayes’ theorem, which is:

P(H | x) = P(H) P(x | H) / P(x)

  • P(H|x) is the posterior probability of hypothesis (H or target) given predictor (x or attribute).
  • P(H) is the prior probability of hypothesis
  • P(x|H) is the likelihood which is the probability of predictor given hypothesis.
  • P(x) is the prior probability of predictor.

Naive Bayes extends Bayes’ theorem to handle multiple evidences by assuming that each evidence is independent.

P(H | x1….xn) = P(H) P(x1 | H)…P(xn | H) / P(x1)….P(xn)

In most of these problems we will compare the probabilities H being true or false and the denominator will not affect the outcome so we can simply calculate the numerator.

Example

Take a look at a simple example above to understand Naive Bayes parameters in the context of movie reviews.

Loading the data

We’ll be working with a CSV file containing movie reviews. Each row contains the text of the review, as well as a number indicating whether the tone of the review is positive(1) or negative(-1).

We want to predict whether a review is negative or positive, based on the text alone. To do this, we’ll train an algorithm using the reviews and classifications in train.csv, and then make predictions on the reviews in test.csv. We’ll be able to calculate our error using the actual classifications in test.csv to see how good our predictions were.

import csv
with open("train.csv", 'r') as file:
    reviews = list(csv.reader(file))
    
print(reviews[0]) 
print(len(reviews))
['plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what\'s the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt', '-1']

1418

Training a Model

Obtaining P(H) : the prior probability of hypothesis (positive review)

Now that we have the word counts, we just need to convert them to probabilities and multiply them out to predict the classifications.

Let’s start by obtaining the prior probabilities as follows:

# Computing the prior (H=positive reviews) according to the Naive Bayes' equation

def get_H_count(score):
    # Compute the count of each classification occurring in the data
    return len([r for r in reviews if r[1] == str(score)])

# We'll use these counts for smoothing when computing the prediction
positive_review_count = get_H_count(1)
negative_review_count = get_H_count(-1)

# These are the prior probabilities (we saw them in the formula as P(H))
prob_positive = positive_review_count / len(reviews)
prob_negative = negative_review_count / len(reviews)
print("P(H) or the prior is:", prob_positive)
P(H) or the prior is: 0.5007052186177715

Obtaining P(xi | H) : the likelihood

Finding Word Counts

We’re trying to determine if we should classify a data row as negative or positive. The easiest way to generate features from text is to split the text up into words. Each word in a movie review will then be a feature that we can work with. To do this, we’ll split the reviews based on whitespace.

Afterwards, we’ll count up how many times each word occurs in the negative reviews, and how many times each word occurs in the positive reviews. Eventually, we’ll use the counts to compute the probability that a new review will belong to one class versus the other.

We will use the following python library class collections.Counter([iterable-or-mapping])

Where, a Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

# Python class that lets us count how many times items occur in a list
from collections import Counter
import re

def get_text(reviews, score):
    # Join together the text in the reviews for a particular tone
    # Lowercase the text so that the algorithm doesn't see "Not" and "not" as different words, for example
    return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])

def count_text(text):
    # Split text into words based on whitespace -- simple but effective
    words = re.split("\s+", text)
    # Count up the occurrence of each word
    return Counter(words)

negative_text = get_text(reviews, -1)
positive_text = get_text(reviews, 1)

# Generate word counts(WC) dictionary for negative tone
negative_WC_dict = count_text(negative_text)

# Generate word counts(WC) dictionary for positive tone
positive_WC_dict = count_text(positive_text)

Let’s look at few examples:

# example
print("count of word 'awesome' in positive reviews", positive_WC_dict.get("awesome"))
print("count of word 'movie' in positive reviews", positive_WC_dict.get("movie"))

print("count of word 'awesome' in negative reviews", negative_WC_dict.get("awesome"))
print("count of word 'movie' in negative reviews", negative_WC_dict.get("movie"))
count of word 'awesome' in positive reviews 3
count of word 'movie' in positive reviews 304
count of word 'awesome' in negative reviews 1
count of word 'movie' in negative reviews 376

Obtaining P(xi | H) the likelihood or the probability of predictor (a word) given hypothesis (positive review)

  • For every word in the text, we get the number of times that word occurred in the text
  • Multiply it with the probability of that word in Hypothesis as shown below:
prediction =  text_WC_dict.get(word) * (H_WC_dict.get(word) / (sum(H_WC_dict.values())

We add 1 to smooth the value, smoothing ensures that we don’t multiply the prediction by 0 if the word didn’t exist in the training data and correspondingly smooth the denominator counts to keep things even.

After smoothing above equation becomes:

prediction =  text_WC_dict.get(word) * (H_WC_dict.get(word)+1) / (sum(H_WC_dict.values()+ H_count)
# H = positive review or negative review
def make_class_prediction(text, H_WC_dict, H_prob, H_count):
    prediction = 1
    text_WC_dict = count_text(text)
    
    for word in text_WC_dict:       
        prediction *=  text_WC_dict.get(word,0) * ((H_WC_dict.get(word, 0) + 1) / (sum(H_WC_dict.values()) + H_count))

        # Now we multiply by the probability of the class existing in the documents
    return prediction * H_prob

# Now we can generate probabilities for the classes our reviews belong to
# The probabilities themselves aren't very useful -- we make our classification decision based on which value is greater
def make_decision(text):
    
    # Compute the negative and positive probabilities
    negative_prediction = make_class_prediction(text, negative_WC_dict, prob_negative, negative_review_count)
    positive_prediction = make_class_prediction(text, positive_WC_dict, prob_positive, positive_review_count)

    # We assign a classification based on which probability is greater
    if negative_prediction > positive_prediction:
        return -1
    return 1

print("For this review: {0}".format(reviews[0][0]))
print("")
print("The predicted label is ", make_decision(reviews[0][0]))
print("The actual label is ", reviews[0][1])
For this review: plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt

The predicted label is  -1
The actual label is  -1

Making Predictions

Simple Examples

Let’s start with predicting for a hypothetical review shown in the image at the top, “Awesome movie”

make_decision("Awesome movie!")
1 ```python make_decision("movie") ```
-1     So our classifier classifies "Awesome movie!" as a positive review and just the word "Movie" as a negative review. This makes sense because we tried to obtain the count of the words "Awesome" and "Movie" at the end of subsection "Finding Word Counts". While the word "awesome" appears in positive reviews 3 times, it appears once in the negative reviews. Whereas the "movie" appears over 300 times in both cases, but few times more in negative reviews. Given the priors are around 50%, our model is bound to classify the word "movie" as a negative review. There are two major things this teaches us:
  1. Having bigger training set can be better as our classifier is making desicions on the basis of a word that appears so infrequently.
  2. We could be better off by not using words that appear way too frequently in both positive and negative sets to get better results.

On the Test Data Set

Now that we can make predictions, let’s predict the probabilities for the reviews in test.csv

with open("test.csv", 'r') as file:
    test = list(csv.reader(file))

predictions = [make_decision(r[0]) for r in test]

Error analysis on predictions

actual = [int(r[1]) for r in test]

from sklearn import metrics

# Generate the ROC curve using scikits-learn
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

# Measure the area under the curve
# The closer to 1 it is, the "better" the predictions
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))
AUC of the predictions: 0.680701754385965

Naive Bayes implementation in scikit-learn

There are a lot of extensions we could add to this algorithm to make it perform better. We could remove punctuation and other non-characters. We could remove stopwords, or perform stemming or lemmatization.

We don’t want to have to code the entire algorithm out every time, though. An easier way to use Naive Bayes is to use the implementation in scikit-learn. Scikit-learn is a Python machine learning library that contains implementations of all the common machine learning algorithms

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Generate counts from text using a vectorizer  
# We can choose from other available vectorizers, and set many different options
# This code performs our step of computing word counts
vectorizer = CountVectorizer(stop_words='english', max_df=.05)
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])

# Fit a Naive Bayes model to the training data
# This will train the model using the word counts we computed and the existing classifications in the training set
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])

# Now we can use the model to predict classifications for our test features
predictions = nb.predict(test_features)

# Compute the error
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
Multinomal naive bayes AUC: 0.635500515995872

Scikitlearn allowed us to recreate our code that was >40 lines in roughly 8 lines. However, the process of coding Naive Bayes from scratch was invaluable for me to understand nuts and bolts of the algorithm!