Sunday, December 27, 2020

Training a Stochastic Gradient Descent Classifier on Labeled Tweets

I had just looked at the relationship between Parts of Speech and Accuracy and next I wanted to train a classifier to be used specifically on tweets.


This post is a write up about my decisions and the process I used to train a classifier on the labeled tweet data. I used this dataset from kaggle of tweets labeled by sentiment as my source. 


The first part of building a model is to explore the data. 


Screenshot of a section of the raw data from kaggle.


This contained more information than I needed so first I needed to clean it up into a new file. 


I only care about the body of the tweet and the Positive or Negative sentiment so my first step was to write a script to simplify the data into a file with only the tweet text and the sentiment. You can see the method I wrote to simplify the data here.


After I ran that script the data looked like this.


First, I tried using a Linear Support Vector Machine since I already knew how to use it but after running it let it run for a few hours I got this error. 



Inside of this LinearSVC it had converted a vector of Booleans into a vector of floats. That unfortunately would take much more memory then I had access to on my laptop. This led me to try and find a classifier where I could train it in batches rather than all in a single go. 



I ended up setting on using the Stochastic Gradient Descent Classifier from the sklearn module

I choose this because it is a common algorithm for text classification and it has a built in partial_fit() method. This would let me train in batches so I did not have to load the entire training_data into my RAM. 


The first step was to convert a string of text into a very long boolean vector. This along with the target was then passed into the partial fit method as a pair. 

The code I wrote to convert a string, classification pair into a long boolean input vector and boolean classification. 


After writing this I let it run, it took about an hour to make a boolean vector of length 5000 for each tweet and about two hours to make a boolean vector of length 10000. 


Interestingly, since I was training the classifier in batches, after each training session I could compute the accuracy on a subset of the yet unseen data. I wrote a script to write out 10 accuracy scores to a file after each training session. I did this for using 5000 words as features and 10000 words as features. That is shown in the graph below. 

It looks like the vast majority of the learning happens early on and plateaus after 400k tweets. This logarithmic learning curve is consistent with what I have seen elsewhere and the other tests I did earlier in this series. 


After I created this data I uploaded it to a local SQL server on my laptop to run some basic queries. 



I wanted to compare the accuracy when the training size is larger. 



After training the classifier on a million tweets it got up to about 82% accurate. 

On average, after training on almost all of the data this Classifier was about 82% accurate. This is about as good as you can expect with sentiment analysis since the best algorithms are only 80-85% accurate.


Because, there is not very much benefit in those last 5 thousand features. When I roll it out for use on live tweets I might choose to use a smaller features size. 


Later on, I might want to manually limit word_features to remove words that were not communicating meaning. 

This is an excerpt of the word_features used in the Num_features=10,000. Many of these words clearly do not communicate sentiment to a human but were still treated as features in the classifiers. I doubt looking at the word ‘as’ can tell anything about sentiment so it would improve the classifier to ignore that word. For this post I did not remove any words from word_features. Later on it might make more sense to build a custom list of stopwords for twitter. 


Time it took to for each partial_fit() when Num_features is 10000 and the training size is 10,0000



The accuracy rates of training on the 95% of the data using Num_features =10,000. These are respectable accuracy scores.  


Next, I did some spot checks to make sure that sentences that were obviously negative would be classified as negative and vice-versa. 





This is clearly not a positive sentiment. So something must be wrong. I did a bunch more tests and I kept getting positive on every sentence I tried.  


So to see if this was just anecdotal and I was just getting unlucky I wrote a script to query twitter and scrape tweets written in the English containing the word “hate”. If my algo was worth its salt, most of these tweets would be labeled as negative. 




I let this run for about 3,000 tweets. Much to my chagrin every single tweet was labeled as positive. 


So now I had a large amount of debugging to do. First, I wanted to see that the process to convert it to a vector was working properly. To this end I wrote un_vectorize method. This would take a vector and spit out a ‘bag of words’.




I ended up saving the word_features in the vote_classifiers along with the accuracy scores within the VoteClassifier object.






Source


Eventually I figured out the problem was when I wrote:


if (classification). 

Return ‘Positive’

Else:

Return ‘Negative’


The variable classification was a string when it got to this section of code. In Python, if(string) will always return true. So after about 4 hours of debugging I fixed it by rewriting a single line. 


 



I did some spot checks on sentences I wrote and it all looked fine. 


A vector that contains only False values would contain only words that my method has never seen before. In that case there is no good way to classify it so I wrote a clause to return ‘Unsure no known features’ to not get a meaningless classification. 


Next, I ran some tests on tweets containing “hate” and “love” almost all of the tweets with the word “hate” were negative and almost all of those that contained “love” were positive. This means that the classifier passed the smell test on live raw tweets. 


Looking at a live feed of tweets with the word “love” I got this gem:

“I learned this morning that my parents unconditional love expires on New Years Eve” was mislabeled as ‘Positive’ sentiment. 


After finishing the debugging and training of the SCGClassifers, I now had several different VoteClassifiers pickled on my laptop that were, at least at first glance, ready to be used on real world twitter data. 



Tuesday, December 22, 2020

What is the relationship between Part of Speech and Accuracy?

 

 

After removing stop words in the last post, I was curious about what parts of speech impacted accuracy the most. For example, if only verbs were treated as features, would the accuracy be better or worse than if my model considered every word? I tested several different parts of speech, and when there were enough unique features, I took some screenshots of the results below. 


The first section is what decisions I made in training the classifiers. The second section is some theoretical considerations about the limits of this type of model.


There are a couple of ways of parsing by parts of speech. I choose to use the nltk.pos_tag universal tagset since it makes the code more readable and was specific enough for my aims.


I choose to test nouns, verbs, adverbs, adjectives, determiners, auxiliaries, and punctuation. I then compared the results to by not limiting by part of speech. For each model, I set a num_features=1000 since it would make the training faster and would give a reasonable accuracy rate. 


Loop to generate a list of AlgoParams to compare accuracy.


The way that these algorithms determine something as fuzzy as sentiment from text is to first convert the text into a list of features. This kind of a self-referential definition but the way that I think of features is as words that convey meaning about sentiment.


This is clearest to understand when looking at the Naive Bayes. In that algorithm each word is treated as a feature where and the weight of each feature corresponds to the frequency that it occurs in examples in the training data that are labeled as positive or negative. This would mean that words that occur many times in examples labeled negative would have a high negative weight assigned to them. If that word occurred then in an unseen example, that would make it more likely to think it is negative.


I used the 'bag of words' approach for simplicity. This version ignores different ordering of words. It reduced every possible sentence into a 1000 series of booleans representing if a feature is present in a review. In practice this means that each sentence is converted into a 1000 long vector of almost all False with a few True entries. 


My assemble_word_features() method treats the 1000 most frequent words in all_words as features. All of the reviews were then converted into features sets based on the most frequent words. 


When I tried to only treat punctuation (“PUNCT”) as features, there were not 1000 unique words that were labeled as punctuation.  Since I have already established that there is a logarithmic relationship between the size of word_features and accuracy, I decided there were not enough unique words to be treated as features the accuracy would be low. Therefore when there were not 1000 unique words I did not train a classifier for that part of speech. 


The way to interpret these results is to compare the accuracy of the different parts of speech. I expected that the accuracy of not limiting by part of speech to be more accurate than any single part of speech. The simple explanation is that if you feed less total data into the classifiers then they would tend to be less accurate, at least within the testing data. 


Because of how the binary classification algorithms are set up, each classifier must be between 50%-100% accurate. This makes intuitive sense because if a classifier was 20% accurate, you could just swap the labels causing it to become 80% accurate. 


The final accuracy score is a both a function of how good the classifiers are at finding the relationship between the chosen features and sentiment and how much of a relationship there actually is between chosen features and sentiment. There is an upper bound less than 100% accurate since in natural language the sentiment is not always positive or negative but sometimes ambiguous. 


Two humans can look at the sentence:


“I liked the acting but did not like the cinematography”


In good faith, they can disagree if the sentiment is positive or negative. If we were to get the author of that review and ask them “Did you have positive or negative sentiment when you wrote that review?” First they would look at you strangely since that is quite an odd question and second, it's not even clear that there is a correct answer they could give. Because some reviews are ambiguous, no algorithm could be 100% accurate. 


Accuracy is also limited by the actual relationship between features and sentiment. If there is no relationship,(e.g. sentiment is completely random) accuracy should always be a coinflip. I tested this by randomly labeling reviews to be positive or negative and looking at the accuracy of the classifier. 


Code to randomly label documents 




You can also see the accuracy here is always close to 50%. I would expect that if I were to run this a large number of times it would form a normal distribution around 50%. See the test results here. 


There are also context-specific biases that need to be considered. For example, if in my training data there are several glowing reviews for a Marvel movie then the classifiers could learn to associate “Marvel” with positive sentiment when that relationship might not exist in the broader world. If I then were to use these classifiers to look at the sentiment of tweets about a later Marvel movie that happened to be a flop, the model would show an artificially high rate of positive sentiment. 


In that case it would be better to limit the word_features that convey sentiment in the most universal context rather than trying to apply context specific sentiment about movie reviews. This would mean only looking at adjectives or only looking at adverbs. 


There are a variety of features that I would expect to have no or nearly no predictive power. I include them to verify that my intuition matches the output. For example, I didn't expect determiners (words like “an”, “this” and “those”) to communicate anything about sentiment. So if I were to treat only determiners as features I would expect the accuracy would be close to 50%.



Determiners do not have any sentiment prediction power. This is consistent with what I expected. 


Nouns are better at determining sentiment than other parts of speech. This was surprising since I typically think of nouns as value neutral. Eg “movie” does not convey sentiment but “good movie” and “bad movie” does. 


It surprised me that adjectives are less predictive than nouns.  



Accuracy of not limiting by part of speech. This is the standard that the accuracy of other parts of speech is compared to. 


In conclusion, the accuracy of limiting by parts of speech is strictly worse than not limiting by parts of speech. This was the case for my training and testing data but it is unclear if it would be better or worse for looking at Twitter sentiment. There are reasons to think it would be better to limit by only adjectives when looking at Twitter even though it has a lower accuracy rate on the training data. Without a prelabeled set of (tweet:sentiment) pairs to compare the different classifiers on accuracy it is difficult to answer the question statistically. 


Fortunately, I found a dataset on Kaggle that would be much better to use as a training set for looking at sentiment on Twitter. I will be writing up that process and my decisions in training a classifier on that data in the next post.  


Monday, December 14, 2020

Algorithm Parameter Evaluation: Stop Words and Number of Features

When I was working through this project, there were a few questions that I wanted to investigate further. I used the labeled short reviews that were provided in this lecture series. See the labeled dataset here.

What is the impact of removing “stop words” on accuracy? 


What is the impact of the number of words as features on accuracy?


Before answering these questions I needed to write some new code.


I created a data class called AlgoParams that held the positive and negative examples, a Boolean to remove stop words or not, how many words to treat as features, and what parts of speech to consider. I don’t filter by parts of speech in this post but I do later on. This made it easier to tease out the relationship between the different parameters and accuracy.


I refactored the dataLabeling.py module to work with AlgoParam objects. I should have noticed this when I refactored it, but I had hard-coded the number of features at 3000. This caused me a headache since the first few test results were not actually testing any differences. I eventually noticed this and to see if there were any other things I was missing I tested some AlgoParam objects that were intended to have terrible accuracy.


Lastly, I wrote a method to compare the accuracy of the different algorithms. I called this writeAlgoEvaluation(param, classifiers, FeatureSets). It wrote out details of an AlgoParam and the accuracy of the different types of classifiers. For some of the runs I made it write in a more human friendly way while in others I wrote more like a .csv file. Look at all of the test results here.




When I removed stop words, there was a negligible effect on accuracy. This is consistent with intuition and theory. 


When N=1, the accuracy is a coin flip. This is what I expected since knowing that a single word is present in a review is not enough to tell if that review is positive or negative. If I tell you that a movie review contains the word “movie” do you really know anything about the review?


These were the parameters used by the lecturer and they were about 76% accurate. All things considered this is a respectable accuracy. 


Next, I looked at the relationship between the number of words as features (N) and accuracy. I tested different values for N from 100 to 5000 in steps of 100 with a 90-10 and 80-20 Train/Test split. This took about 6 hours on my laptop and gave me 600 different accuracy scores. 


Sample output of effect of number of words as features (N) on accuracy in a .csv format. 


After it finished running I created a scatter plot in Google Sheets of the results. You can download the results here.

The accuracy plateaued at about 75%. Ironically, this was the same as the accuracy I would have gotten if I just used the same parameters as the lecturer. 


I did not expect this, but the variance in accuracy tended to increase with N. 


My explanation for why variance tends to increase with N because when N is larger, less frequent words are treated as features and those features contain more random noise.


Imagine if the word “refreshing” occurs 500 times in positive reviews, and 100 times negative reviews. Then imagine if the word “Spanish” occurs 5 times in positive reviews and 1 time in a negative review. If I tell you a review contains “refreshing” it is stronger evidence that the review is positive than If I tell you a review contains “Spanish.”


When N is small, it will capture more words like “refreshing” and when N is large it will capture more words like “Spanish”. It is reasonable to expect when wordFreq=600 the impact of randomness is less than when wordFreq=6. It makes sense that when N is larger it contains more relationships that are just random noise.


There was a small difference in the variance the 80-20 and 90-10 split.


The average variance of the 80-20 split was 2.01 had a standard deviation of 1.64. The average variance of the 90-10 split was 1.83 and had a standard deviation of 1.30. This is small and mostly disappears if you remove the outlier in the 90-10 split at N=2900. Overall, the difference is negligible as both follow the same general trend. 


After a certain point, N~1500, there are only negligible gains in accuracy but there is a clear increase in variance. For most purposes, using N value of close to 1500 is about as good as you could expect given the classifiers and the dataset. 


Wednesday, December 9, 2020

Learning NLTK in Python: Introduction

 Introduction


In late October 2020, I stumbled across on a series of introductory lectures on Natural Language Processing using the NLTK library for Python. Since at the time I was just finishing a QuickBooks data cleaning project, I saw this as an opportunity to learn more Python and get at least an introduction to how to get insights from large amounts of raw text. Here is a link to the whole project on Github.


In this series of blog posts, I am going to write up the interesting things I learned and explain the process I went through to create a final project of querying twitter to analyze sentiment. When I published this post I was on lecture 18/21 and had yet to finish the entire project. But after finishing my DetermineIdealAlgoParamas.py program to test what different parameters are the best to consider. I figured now would be a good time to write up about the process and what I learned doing this project so far. 


Initially, I decided to work through lectures by creating a new file for every lecture and duplicating the code from the previous lecture into the new file. As I got further in the series it became clear that a large portion of the scripts could be rewritten as methods and then referenced for the lectures going forward. Also, there were some things that piqued my curiosity that I wanted to experiment with myself that were not directly from the lectures. So I divided my repository into subfolders


  •  “LectureNotes” for the code directly from lecture

  • “OwnPrograms” where I would keep the things I was writing on my own

    • This turned out be used for debugging, trying different parameters to train the classifiers, and pickled objects.

  • “custom_NLTK_utils” where I was keeping the common methods. 

    • In addition, when I was rewriting the scripts from lecture to their own methods there were a few things that I was ]unclear on.

    • This gave me a chance to go through them and really try and parse what the methods were doing.

    • In particular after taking some time off from this project to focus on my school work, I came back and looked at dataLabeling.find_features(document, word_features) and did not have the slightest idea what the method did. I evidently had to go back and rewatch that lecture to see what the method was doing. 

 

New Technical Terms 


This was the first Machine Learning (ML) project I had ever done, so had to learn a whole set of new terms. Before, I had kind of thought of ML as a magical black-box prediction machine where data was fed in one side and predictions were spat out the other. What the data and predictions looked like was rather hazy. 


I now know that the project that I was working on was a supervised learning Boolean classification problem for text. In layman's terms I had bunch of movie reviews that were labeled as “Positive” or “Negative”


The goal then was to find a way to convert the large (n ≈ 10,000) labeled data set into a program that would let me call SentimentDeterminer.predictSentiment(“It was boring and unoriginal”) and the program would do its best to classify it as a negative review. 


In abstract, I had a large number of (input, output) pairs and was trying goal was to tease out a relationship between whey some inputs corresponded to some outputs.


For example, I would want to write a program to look at “It was boring and unoriginal”, figure out that the relevant terms were "boring" and "unoriginal" and use that to classify the review as negative.



Algo.train(TrainingSet)


The most striking thing doing this entire project was how trivial the code of training the classification algorithm actually was.


Here is an excerpt from one of the earlier lectures:



This struck me as absolutely ridiculous. The syntax to train an algorithm was a single line of code. 


The entire script from getting the text from the NLTK.corpus to training a simple Naive Bayes classification was only about 40 lines of Python. After this lecture I felt like writing machine learning programs was actually within my grasp.


Pickling 


Python has a library called pickle that lets you store python objects as byte files that can be accessed later. It takes a non-trivial amount of computational work (and therefore time) to train algorithms so this was a way to save time.


I learned that it is often faster to run the computations to train the algorithm once, then pickle those trained classifier objects and just load them in later. That lets me save a bunch of time (not to mention code complexity) since it would replace the entire training process with simply reading in from a file.


Here is the code to read in a list of classifiers I trained earlier with an abbreviated path name.


This was much faster and I expect to use it as part of the program to query Twitter. 


Data Viz and Analysis of the Numerai Leaderboard

There is a protocol built on Ethereum called Numerai. They explain it in more detail on their website but in essence it is a way for anyone...