BERT caught COVID from Twitter

I have spent ~10 hours a day for the past four months working on my MPH thesis and, after so much work, I am happy to report that I am nearing the finish line! I've gotten to the point where I am writing up the final results and I thought it would be helpful to break down some of my reflections on the process in not-so-academic language. My department was very open about the actual subject of the thesis, as well as the methods one could use, which was nice but also made starting a bit harder.

My professional interest is in global health and neglected tropical diseases, but public data on those subjects is sparse and I wanted to focus on something different before I go into a career which concerns those things full-time. I also wanted to incorporate some programming into the process, since I hadn't gotten to bring those skills into my MPH much leading up to the thesis. After a lot of reading through articles and thinking about potential data sources, I landed on the idea of using machine learning methods to detect disease self-reports on social media, an idea that has been applied in a few cases before. My work innovates on those in that I use a new type of machine learning model to classify the social media messages - the Bidirectional Encoder Representations from Transformers (BERT) - and I use the classification task as a means of identifying users with an exposure, instead of as a means of disease surveillance. Roughly, my plan was to gather tweets from early in the pandemic period (May to July 2020) that contained certain keywords, use a classification model to find which of those tweets contained self-report of COVID symptoms, match those "exposed" users with suitably counterfactual "unexposed" users who haven't had a self-reporting tweet, and use another classification model to find tweets later in the follow-up period (9 months after the "exposed" user's self-reporting tweet). The goal at the end of the process is to be able to characterize and estimate the relative rate of long-COVID symptoms in the cohort.

BERT without Ernie

I made use of the relatively new BERT model. This type of language model relies on pre-training with a huge corpus of text to, more or less, provide a contextual background for future prediction or classification tasks. Pre-trained models can then be fine-tuned on more niche data or labelled text. They've become the de-facto NLP tool over the past couple of years, and for good reason: it has incredible accuracy in classification and inference tasks even when fine-tuned on a relatively small set of examples (for example, my COVID self-report classifier was fine-tuned on about 1700 tweets and was ~90% accurate when classifying the validation set) because of the pre-training stages. I relied on COVID-Twitter-BERT which itself relies on the BERT-LARGE model, so thanks to the Digital Epidemiology Lab. From what I have seen in the public health literature, BERTs have been used widely in electronic medical record analysis and for some genomic studies but have a lot of room to grow in the area of processing text provided by patients, or the public in general, to derive insights about attitude, knowledge, or experience. Infoveillance - I really don't like that term - has lost steam as an avenue of research since Google Flu Trends proved to be less than useful on its own, but having a way of determining the amount of people reporting disease symptoms would be very powerful.

I hate Twitter and Twitter hates me

I hate Twitter because of their API rate limits and Twitter hates me because of their API rate limits.

Joking aside, Twitter seriously needs to fix their API rate limits. Their documentation is alright overall, but gives wrong limits for multiple endpoints. On top of that, it urges users to rely on the x-rate-limit-limit and x-rate-limit-remaining HTTP headers which are irrelevant for certain endpoints. This meant that I had to do a bit of trial-and-error searching for the true limit on most of the requests I was making, resulting in a lot of "too many request" errors. Sorry to their servers, although I'm sure they can handle it.

Also on the subject of rate limits, some are prohibitively low even when having an elevated "academic" access like I did. I had to get a full list of all accounts that "exposed" users were following in order to match them with an appropriate "unexposed" user (I tried to follow the method used here), which ended up being the longest step in the process that used the Twitter API because I could only make 15 requests per 15 minute period, with each response returning a max of 1000 accounts. This meant that it took multiple minutes to match each COVID self-reporting user to an non-self-reporting control. I would imagine that Twitter has such an unreasonable limit to deter people from reconstructing networks of users, which is probably valuable information that Twitter keeps for itself.

Google's gift horse

One of my biggest concerns when embarking on this process was the computing power and time it would take to complete millions of categorization tasks with a BERT. Thankfully I was able to take advantage of Google's TPU Research Cloud program which gave me a full month of free time on their fancy Tensor Processing Units, which are created to do the kinds of matrix math that undergird machine learning and do it fast. Each training/validation epoch of almost 2000 tweets took about 3 min and classifying >4 million tweets took a crisp 8 hours. I'm not sure this project would have been feasible without access to the program, since I don't have the resources to buy or rent time on a CUDA-compatible GPU.

As grateful as I was for the opportunity to use this kind of cutting edge hardware, it came with a bit of a learning curve. I have deployed software to serverless environments before, but this was my first time using cloud services in an integrated way: compute instance, storage, and TPU node all working together. In my inexperience, I also managed to set up the TPU node in such a way that it was listed as being a part of the research cloud program, but my account was still being charged for its running time. It was an unpleasant shock to see the charges building up (about $20/hour of run-time), but a quick email to the dedicated research cloud program support staff fixed that and reveresed the unexpected costs.

As a side note: if I was to try to implement this project as a continuous, long-term thing, I think I would spring for a TensorFlow-friendly GPU and dedicated physical server. $20/hour of TPU time would quickly add up and, as others have found, what is gained in the quickness of computation doesn't necessarily overcome the costs.

A little self-reflection

Here at the end of this ramble, I want to end with a bit of mindfulness. All of the above are thoughts about my thesis that have been at the front of my mind, but going through the process of building a complex machine learning model and using it on real-world data have got me thinking about other pandemic-related machine learning applications. I feel like I've seen lots of stories about AI and machine learning during the pandemic, and the overwhelming impression about them is that they have been pretty disappointing. That last link concerns AI used in a clinical environment, but similar tools haven't been useful in disease forecasting or surveillance in the past either, and for similar reasons. Even in reading through scientific literature for my thesis, I've noticed a tendency for researchers to conveniently ignore or, at the least, fail to engage with problems inherent in using large social media datasets to train machine learning models for the purpose of disease recognition. I have tried not to fall into the same trap in my thesis, but the problem with machine learning is that the biases of the data get baked into the model.

In any case, thanks for reading a bit about my thesis! Its been a wild ride, but I have learned a lot and am really proud of what I have been able to accomplish in a relatively short amount of time. If you have any questions about the process, technical aspects, or would like to read a rough draft feel free to reach out to me over email at cummings dot t287 at gmail dot com.