![]() Stock Sentiment AnalysisUsing a Naïve Bayes ClassifierCopy and paste text from a financial news article, i.e. a particular stock or company's performance. This script will determine whether the text has a more positive or more negative sentiment associated with it.
How does this work?This script makes a sentiment determination based off of probabilities calculated from dictionary data. Below is a brief explanation covering how the dictionary is created, how probabilities are calculated, and how sentiment is determined. A more scholarly case study involving the use of the naïve bayes model for sentiment analysis can be found here. DictionaryThe dictionary is created from excerpts labeled positive or negative. Consider the example below.
The loyal dog defends and protects his owner. (Positive) We now have 1 positive example, 1 negative example, and 11 unique words in the dictionary. For each word, we keep a count of how often the word appeared in a positive example and negative example. Dog (1 Positive, 1 Negative) | Loyal (1 Positive, 0 Negative) | Aggressive (0 Positive, 1 Negative) | ... We continue adding examples to the dictionary, keeping track of the number of positive examples, negative examples, and a count how many times each word appeared in a positive / negative context. Calculating ProbabilityWe want to determine and compare the probability that an article is positive to the probability that an article is negative. That is, given a query of words w1 w2 ... wn we want to find: P( positive | article ) = P( positive ) * P( w1 | positive ) * P( w2 | positive ) ... * P( wn | positive ) P( negative | article ) = P( negative ) * P( w1 | negative ) * P( w2 | negative ) ... * P( wn | negative ) where P( positive ) = Number of Positive Examples / Total Number of Examples P( negative ) = Number of Negative Examples / Total Number of Examples For each word w we utilize Laplace Smoothing to calculate P( w | positive ) and P ( w | negative ): P( w | positive ) = ( Number of Positive Examples Containing w + α ) / ( Number of Positive Examples + α * K ) P( w | negative ) = ( Number of Negative Examples Containing w + α ) / ( Number of Negative Examples + α * K ) α is a constant used for Laplace Smoothing (such as 1, 10, 100, 1000). This is used to help compensate for situations where word w is not in the dictionary i.e. Number of Positive Examples Containing w = 0. In this script, a value of 1 is used for alpha. K is the number of features. Since there are only two possible results, positive or negative, there are 2 features. The rest of the values are simply taken from the dictionary data. Now we are able to calculate all the necessary probabilities and multiply them together to find P( positive | article ) and P( negative | article ). The problem is that if we test articles with many words, we quickly approach a value of zero as we multiply the decimal probabilities together. This tests the limits of IEEE floating point numbers and makes comparing the positive and negative probabilities difficult. In order to overcome this problem, we use log probabilities. Recall that the graph of a logarithm looks like this: ![]() Notice that the value of any logarithm where 0 < x < 1 is a negative number. Therefore the logarithm of any probability less than 100% is negative. Also recall from the Product Rule of logarithms where P1 and P2 are probabilities that: log( P1 * P2 ) = log( P1 ) + log( P2 ) Instead of multiplying extraordinarily tiny probabilities, we can simply add their logarithms together to make comparing them easier. Determining SentimentWe should now have log( Ppositive | article ) and log( Pnegative | article ). As per the properties of logarithms mentioned above, we choose whichever log probability is greater (less negative) as the more likely result. Finally, we are able to determine a positive or negative overall sentiment for a sample of text. Why does this not work?There are various pitfalls with using this method to conduct sentiment analysis:
A good article on mitigating issues can be found here. Despite these issues, there are benefits to using the naïve bayes classifier for sentiment analysis. It is quite fast compared to machine learning models, particularly if you pre-calculate the probabilities for each word in the dictionary. This is why it is commonly used to determine whether or not an email is spam. © Christian Carpenter |