Introduction
In late October 2020, I stumbled across on a series of introductory lectures on Natural Language Processing using the NLTK library for Python. Since at the time I was just finishing a QuickBooks data cleaning project, I saw this as an opportunity to learn more Python and get at least an introduction to how to get insights from large amounts of raw text. Here is a link to the whole project on Github.
In this series of blog posts, I am going to write up the interesting things I learned and explain the process I went through to create a final project of querying twitter to analyze sentiment. When I published this post I was on lecture 18/21 and had yet to finish the entire project. But after finishing my DetermineIdealAlgoParamas.py program to test what different parameters are the best to consider. I figured now would be a good time to write up about the process and what I learned doing this project so far.
Initially, I decided to work through lectures by creating a new file for every lecture and duplicating the code from the previous lecture into the new file. As I got further in the series it became clear that a large portion of the scripts could be rewritten as methods and then referenced for the lectures going forward. Also, there were some things that piqued my curiosity that I wanted to experiment with myself that were not directly from the lectures. So I divided my repository into subfolders
“LectureNotes” for the code directly from lecture
“OwnPrograms” where I would keep the things I was writing on my own
This turned out be used for debugging, trying different parameters to train the classifiers, and pickled objects.
“custom_NLTK_utils” where I was keeping the common methods.
In addition, when I was rewriting the scripts from lecture to their own methods there were a few things that I was ]unclear on.
This gave me a chance to go through them and really try and parse what the methods were doing.
In particular after taking some time off from this project to focus on my school work, I came back and looked at dataLabeling.find_features(document, word_features) and did not have the slightest idea what the method did. I evidently had to go back and rewatch that lecture to see what the method was doing.
New Technical Terms
This was the first Machine Learning (ML) project I had ever done, so had to learn a whole set of new terms. Before, I had kind of thought of ML as a magical black-box prediction machine where data was fed in one side and predictions were spat out the other. What the data and predictions looked like was rather hazy.
I now know that the project that I was working on was a supervised learning Boolean classification problem for text. In layman's terms I had bunch of movie reviews that were labeled as “Positive” or “Negative”
The goal then was to find a way to convert the large (n ≈ 10,000) labeled data set into a program that would let me call SentimentDeterminer.predictSentiment(“It was boring and unoriginal”) and the program would do its best to classify it as a negative review.
In abstract, I had a large number of (input, output) pairs and was trying goal was to tease out a relationship between whey some inputs corresponded to some outputs.
For example, I would want to write a program to look at “It was boring and unoriginal”, figure out that the relevant terms were "boring" and "unoriginal" and use that to classify the review as negative.
Algo.train(TrainingSet)
The most striking thing doing this entire project was how trivial the code of training the classification algorithm actually was.
Here is an excerpt from one of the earlier lectures:
This struck me as absolutely ridiculous. The syntax to train an algorithm was a single line of code.
The entire script from getting the text from the NLTK.corpus to training a simple Naive Bayes classification was only about 40 lines of Python. After this lecture I felt like writing machine learning programs was actually within my grasp.
Pickling
Python has a library called pickle that lets you store python objects as byte files that can be accessed later. It takes a non-trivial amount of computational work (and therefore time) to train algorithms so this was a way to save time.
I learned that it is often faster to run the computations to train the algorithm once, then pickle those trained classifier objects and just load them in later. That lets me save a bunch of time (not to mention code complexity) since it would replace the entire training process with simply reading in from a file.
Here is the code to read in a list of classifiers I trained earlier with an abbreviated path name.
This was much faster and I expect to use it as part of the program to query Twitter.