Tuesday, January 19, 2021

Economics Capstone Project: Data Mining the Ethereum Blockchain.

What is the relationship between own-price elasticity and the characteristics of an Ethereum mining firm?

I just enrolled in my Economics Capstone class at Eastern Washington University. It being my last quarter before graduation, this class ought to be the summation of what I learned. The purpose of the class is to write an empirical Economics paper for an academic journal. I figure while I am doing that, I can write up about the process of gathering and doing analysis on that data for this blog.  


In this post I am going to talk about the amount of data that is free in this space and the problem that I am going to investigate in my capstone class.


For context, a few months ago my brother and I built an Ethereum Mining computer. As part of doing the research into the business feasibility of that project there were a few things that piqued my interest. 


There are some technical terms to understand, to get the context of the problem I am investigating. 


  • Blockchain: a decentralized, distributed ledger. The entire Blockchain is public and exists on every Ethereum node. 

  • Ethereum: a cryptocurrency denoted as ETH

  • Wallet : The address on blockchain where the Ethereum is owned. A 40 Character Long string of case sensitive letters and numbers 

    • 4368d11f47764B3912127B70e8647Dd031955A7C is a wallet address

  • Hash: The base unit of computational work for maintaining the blockchain. 

  • Mining: Computational work to maintain the block chain. You can think of this like having computers getting paid in ETH to solve hard math problems. This is measured in Mega Hashes /Second (Mh/s)

  • Mining Firm: An entity (can be one person in their basement or a multimillion dollar corporation) 

  • Mining Pool: A platform that allows mining firms to pool their hashing power to reduce the variance in income. 


If you want a better understanding read this. 


The Blockchain is public. This means that anyone can see the details of any transaction. For this project, what interests me is the amount of ether, the date, the from address, and the to address.


The majority of all Ethereum mining takes place in mining pools and the wallets of those mining pools are publicly available. Those pools periodically send ETH to different mining accounts. Ethermine, one of the larger Ethereum mining pools, is responsible for about 20% of global Ethereum Mining. So if I can model that, then I can get a good grasp on 20% of the entire global market.  


For example, I went to Ethermine.org, and scrolled down and clicked on a miner at random. 



Here is the miner I clicked on. I have no idea who this person is. I just know that they are currently a miner at ethermine.org.


There is a bunch of information here so let's just look at the “Payouts” tab. 



This shows that the dates that they have been paid that amount of Ethereum. 


It is also possible to see their wallet in more detail using a Block Explorer. Check it out for yourself here.


You can then use a platform like Etherscan.io and see that the first transaction on their wallet was on July 20, 2018. This lets you infer when they started mining Ethereum. 


It is absolutely shocking that this kind of data is free. There is a massive amount of free raw data so this seems like a doable exciting project. 


To get a grasp of how utterly ridiculous it is that this kind of information is publicly available let's look at what this kind of data would look like in another context. 


Imagine that Amanda is a budding entrepreneur looking at opening her own hair salon. In this hypothetical world, all haircuts are identical and the market price of a haircut varies widely. She can look up and get detailed analytics, for free, the number of haircuts done by every single other hair salon since the invention of hair. She can see that when the price of haircuts is high, firms start producing more haircuts and when the price is low, they start producing less haircuts. This is easy to understand at a theoretical level but the important thing is the speed with which they ramp up or ramp down production. If she can be substantially faster than her competition she would be in a good position to be very profitable. 


That information is essential to knowing if it would be prudent to even go into the hair salon business. It also informs the decision of whether or not to ramp up production, and if so, by how much. 


There is more complexity in the Ethereum miners model than in the simplified example above. I plan on explaining more of that complexity in a later post. 


So the problem I will be investigating is what is the own-price elasticity of different categories of Ethereum mining firms. Are the large firms faster and more responsive to price changes, or are the small firms faster?









Tuesday, January 12, 2021

Lemmatizing and stopwords to Reduce Dimensions

 Improving the encoding of tweets. Stop words and lemmatizing. 


Earlier, I ran into the problem where weights were assigned to some words, that as a human, make no sense. What stood out was the weights on commas and the word “not".


In this post, I describe the reasoning and tools I use to better choose the contents of the word_features. The goal was to make the explanations more credible and keep an accuracy rate of >80%. In the end, I failed to train a more human interpretable model at least on my laptop.


First I wanted a picture of how many weights were significant and how many weights were trivial. 


I did this by counting the number of words above a certain arbitrary weight cutoff. 





This returns 1090. Or about 18% of the dataset 




This returned 3029 elements. Or about 50% of the dataset. The presence or non-presence of about half of the data had a very small impact on classification. This would mean I could remove these elements from word_features without a substantial loss of accuracy. 



A sample of the words with negligible impact on sentiment. This matches intuitive sense. 



Here are some of the words with weights >.5. The weights assigned to these words make intuitive sense.


For code cleanness, I decided to create a CustomLemmatizer class to store all the methods related to stopwords and lemmatizing. I choose to do it like this because it would let me encapsulate away the lemmatizing to its own module since the only point of interaction was to determine_lemmas() on a string. Once I had debugged and was satisfied with the lemmatizing module I could safely forget the details of the implementation and focus on other parts of the code.


I choose to use the WordNetLemmatizer. This is a built-in lemmatizer that comes packaged with NLTK. 



At this point I tried to make it work on a Naive Bayes algorithm but I kept on getting memory errors. It seemed like converting every document into a dictionary was taking up too much space. 


After fiddling around with the implementations for a bit to try and reduce the size of the representation of the data I keep on getting an error that said I ran out of memory. It would take my laptop running at fullbore for about 2 hours to get that error. This was leading to very slow debugging processes.


So I decided to step back and try and implement the training on a distributed computing system. I wrote the script to limit the word_features to a much more manageable number but I could not successfully train the classifier I was aiming to anymore on my laptop. I could have just used the word_features with the Stochastic Gradient Descent classifier from earlier but I was trying to make the algorithm more intuitive to a human and SGD is an opaque model.


The next step then is to figure out how to train Naive Bayes on the cloud. I am currently take a Big Data Analytics Class at Eastern Washington University as part of the curriculum we are learning how to use Google Colab. I expect to learn how to use that tool and then to come back to this problem and train the classifier on the cloud.






Sunday, January 3, 2021

Encoding Tweets and Human Interpretability


A common problem in machine learning is not just to just predict an outcome, but also to explain why the model predicted that outcome. This is broadly understood as the human interpretability problem. Being able to explain a prediction makes it easier to convince people to act on it and reduces the risk of using reasoning that doesn't pass muster to a subject expert. 


My goal for this post is to explain more deeply the tweet encoding process and write some methods that will explain to a human reader why a specific prediction was made. 


The end goal was to have an output that looked similar to this:


VoteClassifier.explain_choice(“I was very sad today. I woke up and had a headache”).


-> The Classification was “negative” and the score was -3.2 

-> This classification is based on considering these words (‘sad’, ‘I’, ‘today’,‘headache’,’woke’, ‘up)


-> The word with the most negative weight at -3.5 was ‘sad’

-> The word with the most positive weight at .5 was ‘today’




Diagram of the information encoding process.

To get a better understanding of why the classifier makes a classification first there needs to be a more in depth understanding of the data encoding process. Because actually determining the sample space is not trivial I just estimated what I think are some reasonable upper bounds.


A tweet begins as a member of the set of all possible tweets. Without going into much detail about sample spaces, it is safe to assume that the set is very large. That set contains information about word order and if any words are duplicates. The next step in encoding is to convert that tweet into a list of unique words. When that conversion takes place, information about the order of words and if there are any duplicates is lost. In theory this means that multiple tweets will be mapped onto a single list of unique words. In practice, this is unlikely to be significant since the sample space at this point is still uncountably massive. In addition, even if more than one tweet is mapped onto the same list of unique words those tweets would likely have the same sentiment to human readers. 


Once it has a list of unique words the find_features() method will walk through to create a boolean vector where each entry in the vector is a boolean of the words that is present in the tweet.


Here I think a simplified example would make it more clear.


Let word_features:

 [‘great’, ‘today’, ‘sad’, ‘happy’, ‘I’, ‘exciting’]


Let the list of unique words be:


 [‘sad’, ‘I’, ‘today’,‘headache’,’woke’, ‘up,]


This would convert into a boolean vector:


[False, True, True, False,True, False]


This is a simplified example where word_features only has 6 elements. This corresponds to a sample size of 2^6 or 64. What that would mean is that every possible tweet would be encoded into only 64 unique boolean vectors. Clearly this is much too small to capture the real differences between tweets. In the model I trained on there are 6000 elements in word_features. This would mean that the total sample space would be 2^6000 or a number with 1807 digits. This set is still very massive. Even still, whenever a word occurs that is an element of the 6000 word_features that information is lost. 


Interestingly, the set of all possible tweets is so large that even after you remove all information about word order and duplicates, then remove all but the presence or absence of 6000 words, the sample space is still massive. 


I expect that in nearly all cases there will not be more than 50 words that are present in both the list of unique words and the word_features. So the sample space can be thought of as being about (given 6000 options choose <=50) or about 2^400.


But in practice it is good enough to know that the sample space is so large that it is very unlikely that multiple tweets with different sentiment will be mapped onto the same vector.


So while the sample space of tweets is still massive, a large amount of information is lost in the encoding process. Since the algorithm is only trained on the encoded information, it can only explain based on that information. This means it can only explain its predictions based on the presence of a small subset of words in the original tweet.


The knowledge that a classifier only looked a few words is not very informative to a human reader so I choose to reach into the internals of the SGDClassifier look at the weights assigned to each word in word_features. 


This comes from the attribute in SGDClassifier.coef_. These numbers correspond to weights assigned to each of the dimensions within the sample vector. In binary classification you can interpret these weights as the positive or negative significance of each term.


An easy way to interpret the weights is to think: Large positive weight : strong predictor of positive sentiment. Large negative weight : strong predictor of negative sentiment. Small absolute value of weight : not a strong predictor. 


Once I had this information I could write some methods to demystify the classification process. I decided to write these methods inside of my VoteClassifier class.


First, I wrote a method to pair each word with the average weight assigned to it by the classifiers.



Next, I wrote a method that simulates the encoding process to simulate the information lost. This would return a list of words that were the basis for the classification. 

 




I then wrote a method to calculate the weights assigned to each word. This a human would be able to interpret as seeing the list of words that had the most impact on the classification.


A human can look at the weights and see that the word “sad” was many times more clearly negative than the word ‘so’. The -4.2 weight of “sad” is much larger than -.19 weight of “was”.


The weights assigned to these words make intuitive sense to a human reader. That is a good sign. Later on there are some 


Next I wanted to see the average score given to a tweet. Since the classifiers work internally with a floating point number and only return a boolean I could reach in and look at how close to 0 or 1 that float was. Because I encoded Negative Sentiment as 0 and Positive Sentiment as 1 you can interpret the scores with a large absolute value as meaning high confidence in the classification and scores with a small absolute value as meaning a lower confidence in the classification. Scores that are near 0 and 1 are particularly ambiguous. 



Finally, I stitched all of the methods together in an explain_choice() method. It would explain in plain english what features had the most significance. To do that I needed to sort by the absolute value of the feature weights. 



Here are some sample runs of the VoteClassifier.explain_choice() method.





After looking at the preliminary results of the explain_choice method I think that the next major way to improve the classification algorithm is to create a list of words to be excluded from word_features.


Earlier in this series I tested limiting by parts of speech and stopwords. Removing stop words did not seem to have any impact on accuracy but just reduced the total time it took to train the classifiers. Since I was not doing any training that would take longer than a few hours it did not seem worthwhile to add to make more code more complex to save a little time on the training. 


Looking at it now, it seems like the benefit to a list of stop words is not in the impact on accuracy but in the removal of spurious relationships. This, if done well, would not have an impact on accuracy but it certainly would make the explanations for humans more intuitive.


It might make sense to also lemmatize the words before converting them to vectors. Lemmatizing is a process that converted the words “cry”, “cried”, and “crying” into a single word before encoding. This would mean that word associations are not limited by word tense.


It makes no sense to a human that a comma should have a weight of +.49. 


The word “not” has a weight of  -.81 and I am not sure what to make of that. On one hand, based on the weight, the presence “not” is a predictor that a sentiment is negative based purely on the frequency it is in positive or negative examples. On the other hand, “not” only makes sense when it refers to something, “not good” and “not bad” have opposite sentiment. This makes it unclear if it is best to keep “not” in the word_features. It will likely make the classifier more accurate to keep it in, but will make its explanations less credible to a human. 



Data Viz and Analysis of the Numerai Leaderboard

There is a protocol built on Ethereum called Numerai. They explain it in more detail on their website but in essence it is a way for anyone...