Parker Burchett's Blog

Monday, March 8, 2021

Data Viz and Analysis of the Numerai Leaderboard

There is a protocol built on Ethereum called Numerai. They explain it in more detail on their website but in essence it is a way for anyone to build and then bet on algorithms to predict the stock market. Users submit predictions based on obfuscated data provided Numerai and then can "stake" their predictions. When a user has a positive correlation with the real world they make money, otherwise they lose money.

As of this writing, about 900 people have submitted predictions that made them 10% ROI on their original investment over 3 months. There is currently ~$10M staked on these predictions on 2,500 models.

I used Jupyter Notebooks on Google Colab to do some exploratory analysis on the features of profitable users. The notebook includes my source code and commentary. It pings their GraphQL API to cast the entire leaderboard into Pandas DataFrame. I then run some OLS Regression and create some charts on the relationships between how much a person bets on their model (stake size) and how profitable they are. Surprising there a significant, but very small, difference between betting $40 and $40,000. People who bet more tend to do better in the long run but the difference is very small.

Check out my code in this notebook here!

I took some screenshots of the histograms I created using matplotlib and included them below.

Tuesday, January 19, 2021

Economics Capstone Project: Data Mining the Ethereum Blockchain.

What is the relationship between own-price elasticity and the characteristics of an Ethereum mining firm?

I just enrolled in my Economics Capstone class at Eastern Washington University. It being my last quarter before graduation, this class ought to be the summation of what I learned. The purpose of the class is to write an empirical Economics paper for an academic journal. I figure while I am doing that, I can write up about the process of gathering and doing analysis on that data for this blog.

In this post I am going to talk about the amount of data that is free in this space and the problem that I am going to investigate in my capstone class.

For context, a few months ago my brother and I built an Ethereum Mining computer. As part of doing the research into the business feasibility of that project there were a few things that piqued my interest.

There are some technical terms to understand, to get the context of the problem I am investigating.

Blockchain: a decentralized, distributed ledger. The entire Blockchain is public and exists on every Ethereum node.
Ethereum: a cryptocurrency denoted as ETH
Wallet : The address on blockchain where the Ethereum is owned. A 40 Character Long string of case sensitive letters and numbers

4368d11f47764B3912127B70e8647Dd031955A7C is a wallet address

Hash: The base unit of computational work for maintaining the blockchain.
Mining: Computational work to maintain the block chain. You can think of this like having computers getting paid in ETH to solve hard math problems. This is measured in Mega Hashes /Second (Mh/s)
Mining Firm: An entity (can be one person in their basement or a multimillion dollar corporation)
Mining Pool: A platform that allows mining firms to pool their hashing power to reduce the variance in income.

If you want a better understanding read this.

The Blockchain is public. This means that anyone can see the details of any transaction. For this project, what interests me is the amount of ether, the date, the from address, and the to address.

The majority of all Ethereum mining takes place in mining pools and the wallets of those mining pools are publicly available. Those pools periodically send ETH to different mining accounts. Ethermine, one of the larger Ethereum mining pools, is responsible for about 20% of global Ethereum Mining. So if I can model that, then I can get a good grasp on 20% of the entire global market.

For example, I went to Ethermine.org, and scrolled down and clicked on a miner at random.

Here is the miner I clicked on. I have no idea who this person is. I just know that they are currently a miner at ethermine.org.

There is a bunch of information here so let's just look at the “Payouts” tab.

This shows that the dates that they have been paid that amount of Ethereum.

It is also possible to see their wallet in more detail using a Block Explorer. Check it out for yourself here.

You can then use a platform like Etherscan.io and see that the first transaction on their wallet was on July 20, 2018. This lets you infer when they started mining Ethereum.

It is absolutely shocking that this kind of data is free. There is a massive amount of free raw data so this seems like a doable exciting project.

To get a grasp of how utterly ridiculous it is that this kind of information is publicly available let's look at what this kind of data would look like in another context.

Imagine that Amanda is a budding entrepreneur looking at opening her own hair salon. In this hypothetical world, all haircuts are identical and the market price of a haircut varies widely. She can look up and get detailed analytics, for free, the number of haircuts done by every single other hair salon since the invention of hair. She can see that when the price of haircuts is high, firms start producing more haircuts and when the price is low, they start producing less haircuts. This is easy to understand at a theoretical level but the important thing is the speed with which they ramp up or ramp down production. If she can be substantially faster than her competition she would be in a good position to be very profitable.

That information is essential to knowing if it would be prudent to even go into the hair salon business. It also informs the decision of whether or not to ramp up production, and if so, by how much.

There is more complexity in the Ethereum miners model than in the simplified example above. I plan on explaining more of that complexity in a later post.

So the problem I will be investigating is what is the own-price elasticity of different categories of Ethereum mining firms. Are the large firms faster and more responsive to price changes, or are the small firms faster?

Tuesday, January 12, 2021

Lemmatizing and stopwords to Reduce Dimensions

Improving the encoding of tweets. Stop words and lemmatizing.

Earlier, I ran into the problem where weights were assigned to some words, that as a human, make no sense. What stood out was the weights on commas and the word “not".

In this post, I describe the reasoning and tools I use to better choose the contents of the word_features. The goal was to make the explanations more credible and keep an accuracy rate of >80%. In the end, I failed to train a more human interpretable model at least on my laptop.

First I wanted a picture of how many weights were significant and how many weights were trivial.

I did this by counting the number of words above a certain arbitrary weight cutoff.

This returns 1090. Or about 18% of the dataset

This returned 3029 elements. Or about 50% of the dataset. The presence or non-presence of about half of the data had a very small impact on classification. This would mean I could remove these elements from word_features without a substantial loss of accuracy.

A sample of the words with negligible impact on sentiment. This matches intuitive sense.

Here are some of the words with weights >.5. The weights assigned to these words make intuitive sense.

For code cleanness, I decided to create a CustomLemmatizer class to store all the methods related to stopwords and lemmatizing. I choose to do it like this because it would let me encapsulate away the lemmatizing to its own module since the only point of interaction was to determine_lemmas() on a string. Once I had debugged and was satisfied with the lemmatizing module I could safely forget the details of the implementation and focus on other parts of the code.

I choose to use the WordNetLemmatizer. This is a built-in lemmatizer that comes packaged with NLTK.

At this point I tried to make it work on a Naive Bayes algorithm but I kept on getting memory errors. It seemed like converting every document into a dictionary was taking up too much space.

After fiddling around with the implementations for a bit to try and reduce the size of the representation of the data I keep on getting an error that said I ran out of memory. It would take my laptop running at fullbore for about 2 hours to get that error. This was leading to very slow debugging processes.

So I decided to step back and try and implement the training on a distributed computing system. I wrote the script to limit the word_features to a much more manageable number but I could not successfully train the classifier I was aiming to anymore on my laptop. I could have just used the word_features with the Stochastic Gradient Descent classifier from earlier but I was trying to make the algorithm more intuitive to a human and SGD is an opaque model.

The next step then is to figure out how to train Naive Bayes on the cloud. I am currently take a Big Data Analytics Class at Eastern Washington University as part of the curriculum we are learning how to use Google Colab. I expect to learn how to use that tool and then to come back to this problem and train the classifier on the cloud.

Sunday, January 3, 2021

Encoding Tweets and Human Interpretability

A common problem in machine learning is not just to just predict an outcome, but also to explain why the model predicted that outcome. This is broadly understood as the human interpretability problem. Being able to explain a prediction makes it easier to convince people to act on it and reduces the risk of using reasoning that doesn't pass muster to a subject expert.

My goal for this post is to explain more deeply the tweet encoding process and write some methods that will explain to a human reader why a specific prediction was made.

The end goal was to have an output that looked similar to this:

VoteClassifier.explain_choice(“I was very sad today. I woke up and had a headache”).

-> The Classification was “negative” and the score was -3.2

-> This classification is based on considering these words (‘sad’, ‘I’, ‘today’,‘headache’,’woke’, ‘up)

-> The word with the most negative weight at -3.5 was ‘sad’

-> The word with the most positive weight at .5 was ‘today’

Diagram of the information encoding process.

To get a better understanding of why the classifier makes a classification first there needs to be a more in depth understanding of the data encoding process. Because actually determining the sample space is not trivial I just estimated what I think are some reasonable upper bounds.

A tweet begins as a member of the set of all possible tweets. Without going into much detail about sample spaces, it is safe to assume that the set is very large. That set contains information about word order and if any words are duplicates. The next step in encoding is to convert that tweet into a list of unique words. When that conversion takes place, information about the order of words and if there are any duplicates is lost. In theory this means that multiple tweets will be mapped onto a single list of unique words. In practice, this is unlikely to be significant since the sample space at this point is still uncountably massive. In addition, even if more than one tweet is mapped onto the same list of unique words those tweets would likely have the same sentiment to human readers.

Once it has a list of unique words the find_features() method will walk through to create a boolean vector where each entry in the vector is a boolean of the words that is present in the tweet.

Here I think a simplified example would make it more clear.

Let word_features:

[‘great’, ‘today’, ‘sad’, ‘happy’, ‘I’, ‘exciting’]

Let the list of unique words be:

[‘sad’, ‘I’, ‘today’,‘headache’,’woke’, ‘up,]

This would convert into a boolean vector:

[False, True, True, False,True, False]

This is a simplified example where word_features only has 6 elements. This corresponds to a sample size of 2^6 or 64. What that would mean is that every possible tweet would be encoded into only 64 unique boolean vectors. Clearly this is much too small to capture the real differences between tweets. In the model I trained on there are 6000 elements in word_features. This would mean that the total sample space would be 2^6000 or a number with 1807 digits. This set is still very massive. Even still, whenever a word occurs that is an element of the 6000 word_features that information is lost.

Interestingly, the set of all possible tweets is so large that even after you remove all information about word order and duplicates, then remove all but the presence or absence of 6000 words, the sample space is still massive.

I expect that in nearly all cases there will not be more than 50 words that are present in both the list of unique words and the word_features. So the sample space can be thought of as being about (given 6000 options choose <=50) or about 2^400.

But in practice it is good enough to know that the sample space is so large that it is very unlikely that multiple tweets with different sentiment will be mapped onto the same vector.

So while the sample space of tweets is still massive, a large amount of information is lost in the encoding process. Since the algorithm is only trained on the encoded information, it can only explain based on that information. This means it can only explain its predictions based on the presence of a small subset of words in the original tweet.

The knowledge that a classifier only looked a few words is not very informative to a human reader so I choose to reach into the internals of the SGDClassifier look at the weights assigned to each word in word_features.

This comes from the attribute in SGDClassifier.coef_. These numbers correspond to weights assigned to each of the dimensions within the sample vector. In binary classification you can interpret these weights as the positive or negative significance of each term.

An easy way to interpret the weights is to think: Large positive weight : strong predictor of positive sentiment. Large negative weight : strong predictor of negative sentiment. Small absolute value of weight : not a strong predictor.

Once I had this information I could write some methods to demystify the classification process. I decided to write these methods inside of my VoteClassifier class.

First, I wrote a method to pair each word with the average weight assigned to it by the classifiers.

Next, I wrote a method that simulates the encoding process to simulate the information lost. This would return a list of words that were the basis for the classification.

I then wrote a method to calculate the weights assigned to each word. This a human would be able to interpret as seeing the list of words that had the most impact on the classification.

A human can look at the weights and see that the word “sad” was many times more clearly negative than the word ‘so’. The -4.2 weight of “sad” is much larger than -.19 weight of “was”.

The weights assigned to these words make intuitive sense to a human reader. That is a good sign. Later on there are some

Next I wanted to see the average score given to a tweet. Since the classifiers work internally with a floating point number and only return a boolean I could reach in and look at how close to 0 or 1 that float was. Because I encoded Negative Sentiment as 0 and Positive Sentiment as 1 you can interpret the scores with a large absolute value as meaning high confidence in the classification and scores with a small absolute value as meaning a lower confidence in the classification. Scores that are near 0 and 1 are particularly ambiguous.

Finally, I stitched all of the methods together in an explain_choice() method. It would explain in plain english what features had the most significance. To do that I needed to sort by the absolute value of the feature weights.

Here are some sample runs of the VoteClassifier.explain_choice() method.

After looking at the preliminary results of the explain_choice method I think that the next major way to improve the classification algorithm is to create a list of words to be excluded from word_features.

Earlier in this series I tested limiting by parts of speech and stopwords. Removing stop words did not seem to have any impact on accuracy but just reduced the total time it took to train the classifiers. Since I was not doing any training that would take longer than a few hours it did not seem worthwhile to add to make more code more complex to save a little time on the training.

Looking at it now, it seems like the benefit to a list of stop words is not in the impact on accuracy but in the removal of spurious relationships. This, if done well, would not have an impact on accuracy but it certainly would make the explanations for humans more intuitive.

It might make sense to also lemmatize the words before converting them to vectors. Lemmatizing is a process that converted the words “cry”, “cried”, and “crying” into a single word before encoding. This would mean that word associations are not limited by word tense.

It makes no sense to a human that a comma should have a weight of +.49.

The word “not” has a weight of -.81 and I am not sure what to make of that. On one hand, based on the weight, the presence “not” is a predictor that a sentiment is negative based purely on the frequency it is in positive or negative examples. On the other hand, “not” only makes sense when it refers to something, “not good” and “not bad” have opposite sentiment. This makes it unclear if it is best to keep “not” in the word_features. It will likely make the classifier more accurate to keep it in, but will make its explanations less credible to a human.

Sunday, December 27, 2020

Training a Stochastic Gradient Descent Classifier on Labeled Tweets

I had just looked at the relationship between Parts of Speech and Accuracy and next I wanted to train a classifier to be used specifically on tweets.

This post is a write up about my decisions and the process I used to train a classifier on the labeled tweet data. I used this dataset from kaggle of tweets labeled by sentiment as my source.

The first part of building a model is to explore the data.

Screenshot of a section of the raw data from kaggle.

This contained more information than I needed so first I needed to clean it up into a new file.

I only care about the body of the tweet and the Positive or Negative sentiment so my first step was to write a script to simplify the data into a file with only the tweet text and the sentiment. You can see the method I wrote to simplify the data here.

After I ran that script the data looked like this.

First, I tried using a Linear Support Vector Machine since I already knew how to use it but after running it let it run for a few hours I got this error.

Inside of this LinearSVC it had converted a vector of Booleans into a vector of floats. That unfortunately would take much more memory then I had access to on my laptop. This led me to try and find a classifier where I could train it in batches rather than all in a single go.

I ended up setting on using the Stochastic Gradient Descent Classifier from the sklearn module.

I choose this because it is a common algorithm for text classification and it has a built in partial_fit() method. This would let me train in batches so I did not have to load the entire training_data into my RAM.

The first step was to convert a string of text into a very long boolean vector. This along with the target was then passed into the partial fit method as a pair.

The code I wrote to convert a string, classification pair into a long boolean input vector and boolean classification.

After writing this I let it run, it took about an hour to make a boolean vector of length 5000 for each tweet and about two hours to make a boolean vector of length 10000.

Interestingly, since I was training the classifier in batches, after each training session I could compute the accuracy on a subset of the yet unseen data. I wrote a script to write out 10 accuracy scores to a file after each training session. I did this for using 5000 words as features and 10000 words as features. That is shown in the graph below.

Chart

It looks like the vast majority of the learning happens early on and plateaus after 400k tweets. This logarithmic learning curve is consistent with what I have seen elsewhere and the other tests I did earlier in this series.

After I created this data I uploaded it to a local SQL server on my laptop to run some basic queries.

I wanted to compare the accuracy when the training size is larger.

After training the classifier on a million tweets it got up to about 82% accurate.

On average, after training on almost all of the data this Classifier was about 82% accurate. This is about as good as you can expect with sentiment analysis since the best algorithms are only 80-85% accurate.

Because, there is not very much benefit in those last 5 thousand features. When I roll it out for use on live tweets I might choose to use a smaller features size.

Later on, I might want to manually limit word_features to remove words that were not communicating meaning.

This is an excerpt of the word_features used in the Num_features=10,000. Many of these words clearly do not communicate sentiment to a human but were still treated as features in the classifiers. I doubt looking at the word ‘as’ can tell anything about sentiment so it would improve the classifier to ignore that word. For this post I did not remove any words from word_features. Later on it might make more sense to build a custom list of stopwords for twitter.

Time it took to for each partial_fit() when Num_features is 10000 and the training size is 10,0000

The accuracy rates of training on the 95% of the data using Num_features =10,000. These are respectable accuracy scores.

Next, I did some spot checks to make sure that sentences that were obviously negative would be classified as negative and vice-versa.

This is clearly not a positive sentiment. So something must be wrong. I did a bunch more tests and I kept getting positive on every sentence I tried.

So to see if this was just anecdotal and I was just getting unlucky I wrote a script to query twitter and scrape tweets written in the English containing the word “hate”. If my algo was worth its salt, most of these tweets would be labeled as negative.

I let this run for about 3,000 tweets. Much to my chagrin every single tweet was labeled as positive.

So now I had a large amount of debugging to do. First, I wanted to see that the process to convert it to a vector was working properly. To this end I wrote un_vectorize method. This would take a vector and spit out a ‘bag of words’.

I ended up saving the word_features in the vote_classifiers along with the accuracy scores within the VoteClassifier object.

Source

Eventually I figured out the problem was when I wrote:

if (classification).

Return ‘Positive’

Else:

Return ‘Negative’

The variable classification was a string when it got to this section of code. In Python, if(string) will always return true. So after about 4 hours of debugging I fixed it by rewriting a single line.

I did some spot checks on sentences I wrote and it all looked fine.

A vector that contains only False values would contain only words that my method has never seen before. In that case there is no good way to classify it so I wrote a clause to return ‘Unsure no known features’ to not get a meaningless classification.

Next, I ran some tests on tweets containing “hate” and “love” almost all of the tweets with the word “hate” were negative and almost all of those that contained “love” were positive. This means that the classifier passed the smell test on live raw tweets.

Looking at a live feed of tweets with the word “love” I got this gem:

“I learned this morning that my parents unconditional love expires on New Years Eve” was mislabeled as ‘Positive’ sentiment.

After finishing the debugging and training of the SCGClassifers, I now had several different VoteClassifiers pickled on my laptop that were, at least at first glance, ready to be used on real world twitter data.