Stock Sentiment Analysis

Using a Naïve Bayes Classifier

Copy and paste text from a financial news article, i.e. a particular stock or company's performance. This script will determine whether the text has a more positive or more negative sentiment associated with it.

How does this work?
Dictionary
Calculating Probability
Determining Sentiment
Why does this not work?

How does this work?

This script makes a sentiment determination based off of probabilities calculated from dictionary data. Below is a brief explanation covering how the dictionary is created, how probabilities are calculated, and how sentiment is determined. A more scholarly case study involving the use of the naïve bayes model for sentiment analysis can be found here.

Dictionary

The dictionary is created from excerpts labeled positive or negative. Consider the example below.

The loyal dog defends and protects his owner. (Positive)
The aggressive dog bit the child. (Negative)

We now have 1 positive example, 1 negative example, and 11 unique words in the dictionary. For each word, we keep a count of how often the word appeared in a positive example and negative example.

Dog (1 Positive, 1 Negative) | Loyal (1 Positive, 0 Negative) | Aggressive (0 Positive, 1 Negative) | ...

We continue adding examples to the dictionary, keeping track of the number of positive examples, negative examples, and a count how many times each word appeared in a positive / negative context.

Calculating Probability

We want to determine and compare the probability that an article is positive to the probability that an article is negative. That is, given a query of words w₁ w₂ ... w_n we want to find:

P( positive | article ) = P( positive ) * P( w₁ | positive ) * P( w₂ | positive ) ... * P( w_n | positive )

P( negative | article ) = P( negative ) * P( w₁ | negative ) * P( w₂ | negative ) ... * P( w_n | negative )

where

P( positive ) = Number of Positive Examples / Total Number of Examples

P( negative ) = Number of Negative Examples / Total Number of Examples

For each word w we utilize Laplace Smoothing to calculate P( w | positive ) and P ( w | negative ):

P( w | positive ) = ( Number of Positive Examples Containing w + α ) / ( Number of Positive Examples + α * K )

P( w | negative ) = ( Number of Negative Examples Containing w + α ) / ( Number of Negative Examples + α * K )

α is a constant used for Laplace Smoothing (such as 1, 10, 100, 1000). This is used to help compensate for situations where word w is not in the dictionary i.e. Number of Positive Examples Containing w = 0. In this script, a value of 1 is used for alpha.

K is the number of features. Since there are only two possible results, positive or negative, there are 2 features. The rest of the values are simply taken from the dictionary data.

Now we are able to calculate all the necessary probabilities and multiply them together to find P( positive | article ) and P( negative | article ). The problem is that if we test articles with many words, we quickly approach a value of zero as we multiply the decimal probabilities together. This tests the limits of IEEE floating point numbers and makes comparing the positive and negative probabilities difficult.

In order to overcome this problem, we use log probabilities. Recall that the graph of a logarithm looks like this:

Notice that the value of any logarithm where 0 < x < 1 is a negative number. Therefore the logarithm of any probability less than 100% is negative.

Also recall from the Product Rule of logarithms where P₁ and P₂ are probabilities that:

log( P₁ * P₂ ) = log( P₁ ) + log( P₂ )

Instead of multiplying extraordinarily tiny probabilities, we can simply add their logarithms together to make comparing them easier.

Determining Sentiment

We should now have log( P_{positive | article} ) and log( P_{negative | article} ). As per the properties of logarithms mentioned above, we choose whichever log probability is greater (less negative) as the more likely result. Finally, we are able to determine a positive or negative overall sentiment for a sample of text.

Why does this not work?

There are various pitfalls with using this method to conduct sentiment analysis:

It only cares about words and how often they appear. It is particularly vulnerable to the sandwich method of presenting information where you present good news, followed by bad news, ending with goods news again. The inability to understand the actual meaning and intent in articles can lead to inaccurate sentiment determinations.
This model does not understand negation i.e. not good or not bad. It is possible to modify the model to properly handle negation.
Stop words or words with no real sentiment. Words like the, it, that, etc. do not provide much in terms of sentiment and can lead to inaccurate sentiment determinations if they are considered. In this script, certain stop words are removed as well as any words under 3 letters long.
The same word can be in the dictionary multiple times such as review, reviews, reviewing, or reviewed. Stemming and Lemmatization are two ways to mitigate this issue.

A good article on mitigating issues can be found here. Despite these issues, there are benefits to using the naïve bayes classifier for sentiment analysis. It is quite fast compared to machine learning models, particularly if you pre-calculate the probabilities for each word in the dictionary. This is why it is commonly used to determine whether or not an email is spam.