Introduction

In this post we will create a model to perform sentiment analysis on tweets about COVID-19 vaccines using the fastai library. I will provide a brief overview of the process here, but a much more in-depth explanation of NLP with fastai can be found in lesson 8 of the fastai course. In part 2 we will use the model for analysis, looking at changes in tweet sentiment over time and how that relates to the progress of vaccination in different countries.

Transfer learning in NLP - the ULMFiT approach

We will be making use of transfer learning to help us create a model to analyse tweet sentiment. The idea behind transfer learning is that neural networks learn information that generalises to new problems, particularly the early layers of the network. In computer vision, for example, we can take a model that was trained on the ImageNet dataset to recognise different features of images such as circles, then apply that to a smaller dataset and fine-tune the model to be more suited to a specific task (e.g. classifying images as cats or dogs). This technique allows us to train neural networks much faster and with far less data than we would otherwise need.

In 2018 a paper introduced a transfer learning technique for NLP called 'Universal Language Model Fine-Tuning' (ULMFiT). The approach is as follows:

Train a language model to predict the next word in a sentence. This step is already done for us; with fastai we can download a model that has been pre-trained for this task on millions of Wikipedia articles. A good language model already knows a lot about how language works in general - for instance, given the sentence 'Tokyo is the capital of', the model might predict 'Japan' as the next word. In this case the model understands that Tokyo is closely related to Japan and that 'capital' refers to 'city' here instead of 'upper-case' or 'money'.
Fine-tune the language model to a more specific task. The pre-trained language model is good at understanding Wikipedia English, but Twitter English is a bit different. We can take the information the Wikipedia model has learned and apply that to a Twitter dataset to get a Twitter language model that is good at predicting the next word in a tweet.
Fine-tune a classification model to identify sentiment using the pre-trained language model. The idea here is that since our language model already knows a lot about Twitter English, it's not a huge leap from there to train a classifier that understands that 'love' refers to positive sentiment and 'hate' refers to negative sentiment. If we tried to train a classifier without using a pre-trained model it would have to learn the whole language from scratch first, which would be very difficult and time consuming.

This notebook will walk through steps 2 and 3 with fastai. We will then apply the model to unlabelled COVID-19 vaccine tweets and save the results for analysis in part 2.

Important: You will need a GPU to train models with fastai, but fortunately for us Google Colab provides us with access to one for free! To use it, select ’Runtime’ from the menu at the top of the notebook, then ’Change runtime type’, and ensure your hardware accelerator is set to ’GPU’ before continuing!

Data preparation

This is a write-up of a submission I made for several Kaggle tasks. The tasks are still open and accepting new entries at the time of writing if you want to enter as well! On Kaggle the data is already readily available when using their notebook servers; however, we are using Google Colab today, so we will need to access the Kaggle API to download the data.

Note: Kaggle also have free GPU credits if you prefer to work on their notebook servers instead.

Getting the data from Kaggle

The first step is to create an API token. To do this, the steps are as follows:

Go to 'Account' on Kaggle and scroll down to the 'API' section.
Expire all current API tokens by clicking 'Expire API Token'.
Click 'Create New API Token', which will automatically download a file called kaggle.json.
Upload the kaggle.json file using the file uploader widget below.

# See https://neptune.ai/blog/google-colab-dealing-with-files for more tips on working with files in Colab
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json

Next, we need to install the Kaggle API.

Note: The API is already preinstalled in Google Colab, but sometimes it’s an outdated version, so it’s best to upgrade it in case.

!pip uninstall -q -y kaggle
!pip install -q --upgrade pip
!pip install -q --upgrade kaggle

     |████████████████████████████████| 1.5MB 5.4MB/s 
     |████████████████████████████████| 58 kB 2.8 MB/s 
  Building wheel for kaggle (setup.py) ... done

The API docs tell us that we need to ensure kaggle.json is in the location ~/.kaggle/kaggle.json, so let's make the directory and move the file.

# https://www.machinelearningmindset.com/kaggle-dataset-in-google-colab/
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
# Check the file in its new directory
!ls /root/.kaggle/
# Check the file permission
!ls -l ~/.kaggle/kaggle.json
#Change the file permission
# chmod 600 file – owner can read and write
# chmod 700 file – owner can read, write and execute
!chmod 600 ~/.kaggle/kaggle.json

kaggle.json
-rw-r--r-- 1 root root 63 Mar 17 15:35 /root/.kaggle/kaggle.json

Now we can download the data using !kaggle dataset download -d username-of-dataset-creator/name-of-dataset.

Note: There is also an API download command on the dataset page that you can copy/paste instead.

# We will be using two datasets for this part, as well as a third dataset for part 2
# To save time in part 2 I'm going to download them all now and save locally
!kaggle datasets download -d gpreda/all-covid19-vaccines-tweets
!kaggle datasets download -d maxjon/complete-tweet-sentiment-extraction-data
!kaggle datasets download -d gpreda/covid-world-vaccination-progress

Downloading all-covid19-vaccines-tweets.zip to /content
  0% 0.00/4.76M [00:00<?, ?B/s]
100% 4.76M/4.76M [00:00<00:00, 156MB/s]
Downloading complete-tweet-sentiment-extraction-data.zip to /content
  0% 0.00/2.58M [00:00<?, ?B/s]
100% 2.58M/2.58M [00:00<00:00, 148MB/s]
Downloading covid-world-vaccination-progress.zip to /content
  0% 0.00/146k [00:00<?, ?B/s]
100% 146k/146k [00:00<00:00, 61.1MB/s]

The files will be downloaded in .zip format, so let's unzip them.

# To unzip you can use the following:
#!mkdir folder_name
#!unzip anyfile.zip -d folder_name

# Or unzip all
!unzip -q \*.zip

3 archives were successfully processed.

Loading and cleaning the data

As with kaggle, an older version of fastai is preinstalled in Colab, so we will need to upgrade it first.

Important: Make a note of the fastai version you are using, since any models you create and save will need to be run using the same version later.

! [ -e /content ] && pip install -Uqq fastai  # upgrade fastai on colab
import fastai; fastai.__version__

     |████████████████████████████████| 193 kB 4.1 MB/s 
     |████████████████████████████████| 776.8 MB 18 kB/s 
     |████████████████████████████████| 12.8 MB 23 kB/s 
     |████████████████████████████████| 53 kB 1.4 MB/s 
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.9.0 requires torch==1.8.0, but you have torch 1.7.1 which is incompatible.

'2.2.7'

Let's import fastai's text module and take a look at our data.

Tip: If you use import *, useful libraries like pandas and numpy will also be imported at the same time!

from fastai.text.all import *

vax_tweets = pd.read_csv('vaccination_all_tweets.csv')
vax_tweets[['date', 'text', 'hashtags', 'user_followers']].head()

We could use the text column of this dataset to train a Twitter language model, but since our end goal is sentiment analysis we will need to find another dataset that also contains sentiment labels to train our classifier. Let's use 'Complete Tweet Sentiment Extraction Data', which contains 40,000 tweets labelled as either negative, neutral or positive sentiment. For more accurate results you could use the 'sentiment140' dataset instead, which contains 1.6m tweets labelled as either positive or negative.

tweets = pd.read_csv('tweet_dataset.csv')
tweets[['old_text', 'new_sentiment']].head()

For our language model, the only input we need is the tweet text. As we will see in a moment fastai can handle text preprocessing and tokenization for us, but it might be a good idea to remove things like twitter handles, urls, hashtags and emojis first. You could experiment with leaving these in for your own models and see how it affects the results. There are also some rows with blank tweets which need to be removed.

We ideally want the language model to learn not just about tweet language, but more specifically about vaccine tweet language. We can therefore use text from both datasets as input for the language model. For the classification model we need to remove all rows with missing sentiment, however.

def de_emojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

# Code via https://www.kaggle.com/pawanbhandarkar/generate-smarter-word-clouds-with-log-likelihood
def tweet_proc(df, text_col='text'):
    df['orig_text'] = df[text_col]
    # Remove twitter handles
    df[text_col] = df[text_col].apply(lambda x:re.sub('@[^\s]+','',x))
    # Remove URLs
    df[text_col] = df[text_col].apply(lambda x:re.sub(r"http\S+", "", x))
    # Remove emojis
    df[text_col] = df[text_col].apply(de_emojify)
    # Remove hashtags
    df[text_col] = df[text_col].apply(lambda x:re.sub(r'\B#\S+','',x))
    return df[df[text_col]!='']

# Clean the text data and combine the dfs
tweets = tweets[['old_text', 'new_sentiment']].rename(columns={'old_text':'text', 'new_sentiment':'sentiment'})
vax_tweets['sentiment'] = np.nan
tweets = tweet_proc(tweets)
vax_tweets = tweet_proc(vax_tweets)
df_lm = tweets[['text', 'sentiment']].append(vax_tweets[['text', 'sentiment']])
df_clas = df_lm.dropna(subset=['sentiment'])
print(len(df_lm), len(df_clas))

70732 31323

df_clas.head()

Training a language model

To train our language model we can use self-supervised learning; we just need to give the model some text as an independent variable and fastai will automatically preprocess it and create a dependent variable for us. We can do this in one line of code using the DataLoaders class, which converts our input data into a DataLoader object that can be used as an input to a fastai Learner.

dls_lm = TextDataLoaders.from_df(df_lm, text_col='text', is_lm=True, valid_pct=0.1)

/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

Here we told fastai that we are working with text data, which is contained in the text column of a pandas DataFrame called df_lm. We set is_lm=True since we want to train a language model, so fastai needs to label the input data for us. Finally, we told fastai to hold out a random 10% of our data for a validation set using valid_pct=0.1.

Let's take a look at the first two rows of the DataLoader using show_batch.

dls_lm.show_batch(max_n=2)

We have a new column, text_, which is text offset by one. This is the dependent variable fastai created for us. By default fastai uses word tokenization, which splits the text on spaces and punctuation marks and breaks up words like can't into two separate tokens. fastai also has some special tokens starting with 'xx' that are designed to make things easier for the model; for example xxmaj indicates that the next word begins with a capital letter and xxunk represents an unknown word that doesn't appear in the vocabulary very often. You could experiment with subword tokenization instead, which will split the text on commonly occuring groups of letters instead of spaces. This might help if you wanted to leave hashtags in since they often contain multiple words joined together with no spaces, e.g. #CovidVaccine. The fastai tokenization process is explained in much more detail here for those interested.

Fine-tuning the language model

The next step is to create a language model using language_model_learner.

learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, 
                               metrics=[accuracy, Perplexity()]).to_fp16()

Here we passed language_model_learner our DataLoaders, dls_lm, and the pre-trained RNN model, AWD_LSTM, which is built into fastai. drop_mult is a multiplier applied to all dropouts in the AWD_LSTM model to reduce overfitting. For example, by default fastai's AWD_LSTM applies EmbeddingDropout with 10% probability (at the time of writing), but we told fastai that we want to reduce that to 3%. The metrics we want to track are perplexity, which is the exponential of the loss (in this case cross entropy loss), and accuracy, which tells us how often our model predicts the next word correctly. We can also train with fp16 to use less memory and speed up the training process.

We can find a good learning rate for training using lr_find and use that to fit our model.

learn.lr_find()

SuggestedLRs(lr_min=0.04365158379077912, lr_steep=0.02754228748381138)

When we created our Learner the embeddings from the pre-trained AWD_LSTM model were merged with random embeddings added for words that weren't in the vocabulary. The pre-trained layers were also automatically frozen for us. Using fit_one_cycle with our Learner will train only the new random embeddings (i.e. words that are in our Twitter vocab but not the Wikipedia vocab) in the last layer of the neural network.

learn.fit_one_cycle(1, 3e-2)

After one epoch our language model is predicting the next word in a tweet around 25% of the time - not too bad! We can unfreeze the entire model, find a more suitable learning rate and train for a few more epochs to improve the accuracy further.

learn.unfreeze()
learn.lr_find()

SuggestedLRs(lr_min=0.0002511886414140463, lr_steep=7.585775847473997e-07)

learn.fit_one_cycle(4, 1e-3)

After a bit more training we can predict the next word in a tweet around 29% of the time. Let's test the model out by using it to write some random tweets (in this case it will generate some text following 'I love').

TEXT = "I love"
N_WORDS = 30
N_SENTENCES = 2
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

i love it when your back is full ! i love it , and i liked it ! I 'm not talking about that to anyone else .
i love this one Although i can see the difference in the way , I 'm sure i ca n't get it . first shot DONE !

Some interesting results there! Let's save the model encoder so we can use it to fine-tune our classifier. The encoder is all of the model except for the final layer, which converts activations to probabilities of picking each token in the vocabulary. We want to keep the knowledge the model has learned about tweet language but we won't be using our classifier to predict the next word in a sentence, so we won't need the final layer any more.

learn.save_encoder('finetuned_lm')

Training a sentiment classifier

To get the DataLoaders for our classifier let's use the DataBlock API this time, which is more customisable.

dls_clas = DataBlock(
    blocks = (TextBlock.from_df('text', seq_len=dls_lm.seq_len, vocab=dls_lm.vocab),
              CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('sentiment'),
    splitter=RandomSplitter()
).dataloaders(df_clas, bs=64)

/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

To use the API, fastai needs the following:

blocks:
- TextBlock: Our x variable will be text contained in a pandas DataFrame. We want to use the same sequence length and vocab as the language model DataLoaders so we can make use of our pre-trained model.
- CategoryBlock: Our y variable will be a single-label category (negative, neutral or positive sentiment).
get_x, get_y: Get data for the model by reading the text and sentiment columns from the DataFrame.
splitter: We will use RandomSplitter() to randomly split the data into a training set (80% by default) and a validation set (20%).
dataloaders: Builds the DataLoaders using the DataBlock template we just defined, the df_clas DataFrame and a batch size of 64.

We can call show batch as before; this time the dependent variable is sentiment.

dls_clas.show_batch(max_n=2)

Initialising the Learner is similar to before, but in this case we want a text_classifier_learner.

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

Finally, we want to load the encoder from the language model we trained earlier, so our classifier uses pre-trained weights.

learn = learn.load_encoder('finetuned_lm')

Fine-tuning the classifier

Now we can train the classifier using discriminative learning rates and gradual unfreezing, which has been found to give better results for this type of model. First let's freeze all but the last layer:

learn.fit_one_cycle(1, 3e-2)

Now freeze all but the last two layers:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

Now all but the last three:

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

Finally, let's unfreeze the entire model and train a bit more:

learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-3/(2.6**4),1e-3))

learn.save('classifier')

Path('models/classifier.pth')

Our model correctly predicts sentiment around 77% of the time. We could perhaps do better with a larger dataset as mentioned earlier, or different model hyperparameters. It might be worth experimenting with this yourself to see if you can improve the accuracy.

We can quickly sense check the model by calling predict, which returns the predicted sentiment, the index of the prediction and predicted probabilities for negative, neutral and positive sentiment.

learn.predict("I love")

('positive', tensor(2), tensor([0.0025, 0.0041, 0.9934]))

learn.predict("I hate")

('negative', tensor(0), tensor([0.9889, 0.0071, 0.0040]))

Classifying unlabelled tweets

To carry out sentiment analysis on the vaccine tweets, we can add them to the DataLoaders as a test set:

pred_dl = dls_clas.test_dl(vax_tweets['text'])

We can then make predictions using get_preds:

preds = learn.get_preds(dl=pred_dl)

Finally, we can save the results for analysis later.

vax_tweets['sentiment'] = preds[0].argmax(dim=-1)
vax_tweets['sentiment'] = vax_tweets['sentiment'].map({0:'negative', 1:'neutral', 2:'positive'})

# Convert dates
vax_tweets['date'] = pd.to_datetime(vax_tweets['date'], errors='coerce').dt.date

# Save to csv
vax_tweets.to_csv('vax_tweets_inc_sentiment.csv')

Conclusion

fastai make NLP really easy, and we were able to get quite good results with a limited dataset and not a lot of training time by using the ULMFiT approach. To summarise, the steps are:

Fine-tune a language model to predict the next word in a tweet, using a model pre-trained on Wikipedia.
Fine-tune a classification model to predict tweet sentiment using the pre-trained language model.
Apply the classifier to unlabelled tweets to analyse sentiment.

In part 2 we will use our new model for analysis, investigating the overall sentiment of each vaccine, how sentiment changes over time and the relationship between sentiment and vaccination progress in different countries.

I hope you found this useful, and thanks very much to Gabriel Preda for providing the data!

1. Cover image via https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/↩

	date	text	hashtags	user_followers
0	2020-12-20 06:06:44	Same folks said daikon paste could treat a cytokine storm #PfizerBioNTech https://t.co/xeHhIMg1kF	['PfizerBioNTech']	405
1	2020-12-13 16:27:13	While the world has been on the wrong side of history this year, hopefully, the biggest vaccination effort we've ev… https://t.co/dlCHrZjkhm	NaN	834
2	2020-12-12 20:33:45	#coronavirus #SputnikV #AstraZeneca #PfizerBioNTech #Moderna #Covid_19 Russian vaccine is created to last 2-4 years… https://t.co/ieYlCKBr8P	['coronavirus', 'SputnikV', 'AstraZeneca', 'PfizerBioNTech', 'Moderna', 'Covid_19']	10
3	2020-12-12 20:23:59	Facts are immutable, Senator, even when you're not ethically sturdy enough to acknowledge them. (1) You were born i… https://t.co/jqgV18kch4	NaN	49165
4	2020-12-12 20:17:19	Explain to me again why we need a vaccine @BorisJohnson @MattHancock #whereareallthesickpeople #PfizerBioNTech… https://t.co/KxbSRoBEHq	['whereareallthesickpeople', 'PfizerBioNTech']	152

	text	text_
0	xxbos xxup rip xxmaj big cup … i will miss you xxbos xxmaj morning do n't ask me why xxmaj i 'm up so early xxbos xxmaj swiss drugmaker to help make xxbos xxmaj roast was yummy , i think mum was impressed xxrep 3 ! xxbos tol xxrep 3 d you there was thunder ! ew now it 's all rainy xxup d : i 'm scared ! xxbos thanks xxmaj	xxup rip xxmaj big cup … i will miss you xxbos xxmaj morning do n't ask me why xxmaj i 'm up so early xxbos xxmaj swiss drugmaker to help make xxbos xxmaj roast was yummy , i think mum was impressed xxrep 3 ! xxbos tol xxrep 3 d you there was thunder ! ew now it 's all rainy xxup d : i 'm scared ! xxbos thanks xxmaj xxunk
1	prime minister for \n xxmaj xxunk xxmaj xxunk . xxbos xxmaj thread : xxmaj as the rolls out the & & how s your state , ( district or xxunk xxbos xxmaj an hour of walking in hot weather = a satisfied but hurting xxmaj xxunk . xxmaj ow , blisters . xxbos awesome that s what i m xxunk the song paranoid . its stuck in my head ! and i	minister for \n xxmaj xxunk xxmaj xxunk . xxbos xxmaj thread : xxmaj as the rolls out the & & how s your state , ( district or xxunk xxbos xxmaj an hour of walking in hot weather = a satisfied but hurting xxmaj xxunk . xxmaj ow , blisters . xxbos awesome that s what i m xxunk the song paranoid . its stuck in my head ! and i love

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.960577	4.097842	0.279537	60.210228	04:20
1	3.799337	4.013978	0.290340	55.366707	04:20
2	3.624613	4.001914	0.295858	54.702751	04:19
3	3.470943	4.027544	0.295395	56.122932	04:20

epoch	train_loss	valid_loss	accuracy	time
0	0.606317	0.567328	0.767401	01:23
1	0.558857	0.560174	0.766762	01:24
2	0.527293	0.562808	0.766922	01:24

	old_text	new_sentiment
0	@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[	NaN
1	Layin n bed with a headache ughhhh...waitin on your call...	negative
2	Funeral ceremony...gloomy friday...	negative
3	wants to hang out with friends SOON!	positive
4	@dannycastillo We want to trade with someone who has Houston tickets, but no one will.	neutral

	text	category
0	xxbos xxup pirate xxup voice : xxrep 3 a xxrep 3 r xxrep 3 g xxrep 3 h xxrep 3 ! i 4got xxup my xxup damn xxup wallet xxup at xxup work xxup xxunk xxrep 3 ! xxup dammit xxrep 3 ! xxup so xxup close xxup yet xxup so xxup far xxrep 3 ! xxup now xxup i m xxup starving xxrep 3 !	negative
1	xxbos xxup ugg xxup want xxup to xxup go xxup to xxup xxunk xxup house xxup but i xxup ca nt xxup finna xxup be xxup bored xxup this xxup weekend xxrep 3 ! xxrep 3 u xxup xxunk xxup wanna xxup spend xxup da xxup nite xxup and xxup go xxup see xxup up xxup and xxup go xxup shopping	neutral