COVID-19 vaccine tweet sentiment analysis with fastai - part 2
This is part two of a two-part NLP series where we carry out sentiment analysis on COVID-19 vaccine tweets. In this part, we visualise changes in tweet sentiment over time for each vaccine, investigate the relationship between sentiment and vaccination progress in different countries and look at the most common words in positive, neutral and negative tweets.
- Analysing overall sentiment
- Timeline analysis for each vaccine
- Further analysis using 'smarter' word clouds
- Conclusion
In part 1, we trained a sentiment classification model and used it to predict the sentiment of tweets about COVID-19 vaccines. Our focus in this part will be to analyse the results from our model.
First, let's load in the data from part 1 and plot the frequency of each sentiment.
vax_tweets = pd.read_csv('https://raw.githubusercontent.com/twhelan22/blog/master/data/vax_tweets_inc_sentiment.csv', index_col=0, parse_dates=['date'])
# Plot sentiment value counts
vax_tweets['sentiment'].value_counts(normalize=True).plot.bar(title='COVID-19 vaccine tweet sentiment');
We can see that the predominant sentiment is neutral, with more positive tweets than negative. It's encouraging that negative sentiment isn't higher! We can also visualise how sentiment changes over time:
# Get counts of number of tweets by sentiment for each date
timeline = vax_tweets.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index().dropna()
# Plot results
fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']},
title='Timeline showing sentiment of tweets about COVID-19 vaccines')
fig.show()
There was a big spike in the number of tweets on March 1st 2021, so let's investigate further. A lot of the tweets appear to be from users in India:
spike = vax_tweets[vax_tweets['date'].astype(str)=='2021-03-01']
spike['user_location'].value_counts(ascending=False).head(10)
spike = spike.sort_values('user_location', ascending=False)
spike['orig_text'].head()
It looks like Indian Prime Minister Narendra Modi received the first dose of Indian developed Covaxin on 1st March. No wonder there were lots of tweets! To dig deeper, let's plot timelines for each vaccine indvidually.
all_vax = ['covaxin', 'sinopharm', 'sinovac', 'moderna', 'pfizer', 'biontech', 'oxford', 'astrazeneca', 'sputnik']
# Function to filter the data to a single vaccine
# Note: a lot of the tweets seem to contain hashtags for multiple vaccines even though they are specifically referring to one vaccine - not very helpful!
def filtered_df(df, vax):
df = df.dropna()
df_filt = pd.DataFrame()
for o in vax:
df_filt = df_filt.append(df[df['orig_text'].str.lower().str.contains(o)])
other_vax = list(set(all_vax)-set(vax))
for o in other_vax:
df_filt = df_filt[~df_filt['orig_text'].str.lower().str.contains(o)]
df_filt = df_filt.drop_duplicates()
return df_filt
# Function to plot the timeline
def plot_timeline(df, title):
title_str = 'Timeline showing sentiment of tweets about the '+title+' vaccine'
timeline = df.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()
fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']}, title=title_str)
fig.show()
covaxin = filtered_df(vax_tweets, ['covaxin'])
plot_timeline(covaxin, title='Covaxin')
# Function to filter the data to a single date and print tweets from users with the most followers
def date_filter(df, date):
return df[df['date'].astype(str)==date].sort_values('user_followers', ascending=False)[['date' ,'orig_text']]
def date_printer(df, dates, num=10):
for date in dates:
display(date_filter(df, date).head(num))
date_printer(covaxin, ['2021-03-01', '2021-03-03'])
Modi wasn't the only person to make news on March 1st; India's External Affairs Minister and a 100-year-old Hyderabad resident also received their first dose of Covaxin. On March 3rd, phase 3 trial results for Covaxin were published, showing 81% efficacy. It makes sense for there to be a spike in the number of neutral and positive tweets about Covaxin on those dates!
sinovac = filtered_df(vax_tweets, ['sinovac'])
plot_timeline(sinovac, title='Sinovac')
Some notable dates:
date_printer(sinovac, ['2021-02-22', '2021-02-28', '2021-03-01', '2021-03-03', '2021-03-08'], 3)
These tweets are about countries starting their vaccination programme or receiving a new shipment of vaccines. Let's use the 'COVID-19 World Vaccination Progress' dataset to plot daily vaccinations for the mentioned countries:
vax_progress = pd.read_csv('https://raw.githubusercontent.com/twhelan22/blog/master/data/country_vaccinations.csv', index_col=0, parse_dates=['date'])
countries = ['Brazil', 'Thailand', 'Hong Kong', 'Colombia', 'Mexico', 'Philippines', 'Indonesia']
fig = px.line(vax_progress[vax_progress['country'].isin(countries)], x='date', y='daily_vaccinations_per_million', color='country',
title='Daily vaccinations per million (all vaccines) in selected countries')
fig.show()
We can see that daily vaccinations per million increased significantly in Colombia and Mexico after they received new shipments of vaccines. Daily vaccinations are also increasing rapidly in Hong Kong after Carrie Lam received the vaccine on February 22nd; however, progress has been slower in Thailand and the Philippines so far.
sinopharm = filtered_df(vax_tweets, ['sinopharm'])
plot_timeline(sinopharm, title='Sinopharm')
As with Sinovac, most of the Sinopharm tweets appear to be positive news regarding countries receiving a shipment of the vaccine:
date_printer(sinopharm, ['2021-02-18', '2021-02-24', '2021-03-02'], 3)
countries = ['Senegal', 'Nepal', 'Hungary', 'Bolivia', 'Lebanon']
fig = px.line(vax_progress[vax_progress['country'].isin(countries)], x='date', y='daily_vaccinations_per_million', color='country',
title='Daily vaccinations per million (all vaccines) in selected countries')
fig.show()
We can see that Hungary ramped up their vaccination programme after the news on February 18th that they would become the first EU country to start administering Sinopharm. In addition, Senegal started vaccinating shortly after positive tweets confirmed that they had received a shipment of Sinopharm vaccines. Unfortunately there is no data for Iraq, but they also started their programme just hours after receiving a donation of vaccines from China.
moderna = filtered_df(vax_tweets, ['moderna'])
plot_timeline(moderna, title='Moderna')
Some notable dates:
date_printer(moderna, ['2021-02-17', '2021-03-05', '2021-03-11'], 3)
On March 2nd Dolly Parton received her dose of the vaccine she helped fund, which explains the initial increase in positive tweets prior to the news about Moderna's collaboration with IBM. By looking at the vaccine progress, we can see that the median daily vaccinations per million in EU countries started to pull further ahead of the rest of the world after news that they would purchase up to 300m extra Moderna vaccines:
countries = ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy',
'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal',
'Romania', 'Slovakia', 'Slovenia', 'Spain','Sweden']
eu = vax_progress[vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
eu['region'] = 'EU'
row = vax_progress[~vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
row['region'] = 'Rest of world'
fig = px.line(eu.append(row), x='date', y='daily_vaccinations_per_million', color='region',
title='Median daily vaccinations per million (all vaccines) in EU countries vs the rest of the world')
fig.add_vline(x='2021-02-17', line_width=3, line_dash='dash', line_color='#00cc96')
fig.add_annotation(x='2021-02-17', y=2120,
text="EU makes a deal to purchase up to 300m extra Moderna vaccines",
showarrow=True,
arrowhead=5, ax=-220, ay=-30)
fig.show()
sputnikv = filtered_df(vax_tweets, ['sputnik'])
plot_timeline(sputnikv, title='Sputnik V')
Some notable dates:
date_printer(sputnikv, ['2021-03-04', '2021-03-05', '2021-03-10', '2021-03-11', '2021-03-15'], 3)
We can see spikes in positive sentiment after various countries agreed to produce the Sputkik V vaccine, and on March 11th after ABC news reported that it was the safest vaccine.
pfizer = filtered_df(vax_tweets, ['pfizer', 'biontech'])
plot_timeline(pfizer, title='Pfizer/BioNTech')
There is a lot to unpack here, so to make things easier let's annotate some of the key dates:
timeline = pfizer.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()
fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']},
title='Timeline showing sentiment of tweets about the PfizerBioNTech vaccine')
fig.add_annotation(x='2020-12-14', y=timeline[(timeline['date']=='2020-12-14')&(timeline['sentiment']=='positive')]['tweets'].values[0],
text="USA and UK start vaccinating",
showarrow=True,
arrowhead=3, ax=55, ay=-210)
fig.add_annotation(x='2020-12-22', y=timeline[(timeline['date']=='2020-12-22')&(timeline['sentiment']=='positive')]['tweets'].values[0],
text="Joe Biden receives first dose",
arrowhead=3, ax=10, ay=-100)
fig.add_annotation(x='2021-01-08', y=timeline[(timeline['date']=='2021-01-08')&(timeline['sentiment']=='positive')]['tweets'].values[0],
text="Vaccine shown to resist new variant",
showarrow=True, align='left',
arrowhead=3, ax=0, ay=-45)
fig.add_annotation(x='2021-01-16', y=timeline[(timeline['date']=='2021-01-16')&(timeline['sentiment']=='negative')]['tweets'].values[0],
text="23 elderly Norwegians die after vaccine dose",
showarrow=True, align='left',
arrowhead=3, ax=15, ay=-180)
fig.add_annotation(x='2021-02-19', y=timeline[(timeline['date']=='2021-02-19')&(timeline['sentiment']=='positive')]['tweets'].values[0],
text="Israeli study shows 85% efficacy after one dose",
showarrow=True, align='left',
arrowhead=3, ax=-30, ay=-180)
fig.add_annotation(x='2021-02-25', y=timeline[(timeline['date']=='2021-02-25')&(timeline['sentiment']=='positive')]['tweets'].values[0],
text="Israeli study shows 94% efficacy after two doses",
showarrow=True, align='left',
arrowhead=3, ax=-20, ay=-130)
fig.show()
oxford = filtered_df(vax_tweets, ['oxford', 'astrazeneca'])
plot_timeline(oxford, title='Oxford/AstraZeneca')
Interestingly, there are small positive spikes on February 19th and March 6th, with people tweeting after receiving the vaccine:
date_printer(oxford, ['2021-02-19', '2021-03-06'], 5)
However, negative sentiment is increasing after numerous countries have suspended the use of the vaccine over safety concerns. We can see that vaccination progress in these countries has slowed significantly over the past few days as a result:
# At the time of writing, these countries have completely suspended the use of the vaccine
# Note that several other countries continued mostly as normal but suspended the use of one batch of Oxford/AstraZeneca vaccines
countries = ['Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Ireland', 'Denmark', 'Norway', 'Bulgaria', 'Iceland', 'Thailand']
ox_prog = vax_progress[vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
ox_prog['Use of Oxford/AstraZeneca'] = 'Suspended'
other_prog = vax_progress[vax_progress['vaccines'].str.contains('Oxford/AstraZeneca')]
other_prog = vax_progress[~vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
other_prog['Use of Oxford/AstraZeneca'] = 'Ongoing'
fig = px.line(ox_prog.append(other_prog), x='date', y='daily_vaccinations_per_million', color='Use of Oxford/AstraZeneca',
title="Median daily vaccinations per million (all vaccines) in countries that have completely suspended the use of the\
<br>Oxford/AstraZeneca vaccine vs countries that continue to use it")
fig.add_vrect(x0="2021-03-11", x1="2021-03-15",
annotation_text="vaccine<br>suspended", annotation_position="bottom right",
fillcolor="limegreen", opacity=0.25, line_width=0)
fig.show()
The overall sentiment of the Oxford/AstraZeneca vaccine is therefore significantly more negative than average:
# Get z scores of sentiment for each vaccine
vax_names = {'Covaxin': covaxin, 'Sinovac': sinovac, 'Sinopharm': sinopharm,
'Moderna': moderna, 'Oxford/AstraZeneca': oxford, 'PfizerBioNTech': pfizer}
sentiment_zscores = pd.DataFrame()
for k, v in vax_names.items():
senti = v['sentiment'].value_counts(normalize=True)
senti['vaccine'] = k
sentiment_zscores = sentiment_zscores.append(senti)
for col in ['negative', 'neutral', 'positive']:
sentiment_zscores[col+'_zscore'] = (sentiment_zscores[col] - sentiment_zscores[col].mean())/sentiment_zscores[col].std(ddof=0)
sentiment_zscores.set_index('vaccine', inplace=True)
# Plot the results
ax = sentiment_zscores.sort_values('negative_zscore')['negative_zscore'].plot.barh(title='Z scores of negative sentiment')
ax.set_ylabel('Vaccine')
ax.set_xlabel('Z score');
Further analysis using 'smarter' word clouds
The final thing we will do is to generate word clouds to see which words are indicative of each sentiment. The code below is from this notebook, which contains a more detailed explanation of the methodology used to generate 'smarter' word clouds. Please go and upvote the original notebook if you find this part useful!
!pip install -q wordninja
!pip install -q pyspellchecker
from wordcloud import WordCloud, ImageColorGenerator
import wordninja
from spellchecker import SpellChecker
from collections import Counter
import matplotlib.pyplot as plt
import re
import math
import random
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.add("amp")
# FUNCTIONS REQUIRED
def flatten_list(l):
return [x for y in l for x in y]
def is_acceptable(word: str):
return word not in stop_words and len(word) > 2
# Color coding our wordclouds
def red_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl(0, 100%, {random.randint(25, 75)}%)"
def green_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl({random.randint(90, 150)}, 100%, 30%)"
def yellow_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl(42, 100%, {random.randint(25, 50)}%)"
# Reusable function to generate word clouds
def generate_word_clouds(neg_doc, neu_doc, pos_doc):
# Display the generated image:
fig, axes = plt.subplots(1,3, figsize=(20,10))
wordcloud_neg = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neg_doc))
axes[0].imshow(wordcloud_neg.recolor(color_func=red_color_func, random_state=3), interpolation='bilinear')
axes[0].set_title("Negative Words")
axes[0].axis("off")
wordcloud_neu = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neu_doc))
axes[1].imshow(wordcloud_neu.recolor(color_func=yellow_color_func, random_state=3), interpolation='bilinear')
axes[1].set_title("Neutral Words")
axes[1].axis("off")
wordcloud_pos = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(pos_doc))
axes[2].imshow(wordcloud_pos.recolor(color_func=green_color_func, random_state=3), interpolation='bilinear')
axes[2].set_title("Positive Words")
axes[2].axis("off")
plt.tight_layout()
plt.show();
def get_top_percent_words(doc, percent):
# Returns a list of "top-n" most frequent words in a list
top_n = int(percent * len(set(doc)))
counter = Counter(doc).most_common(top_n)
top_n_words = [x[0] for x in counter]
return top_n_words
def clean_document(doc):
spell = SpellChecker()
lemmatizer = WordNetLemmatizer()
# Lemmatize words (needed for calculating frequencies correctly )
doc = [lemmatizer.lemmatize(x) for x in doc]
# Get the top 10% of all words. This may include "misspelled" words
top_n_words = get_top_percent_words(doc, 0.1)
# Get a list of misspelled words
misspelled = spell.unknown(doc)
# Accept the correctly spelled words and top_n words
clean_words = [x for x in doc if x not in misspelled or x in top_n_words]
# Try to split the misspelled words to generate good words (ex. "lifeisstrange" -> ["life", "is", "strange"])
words_to_split = [x for x in doc if x in misspelled and x not in top_n_words]
split_words = flatten_list([wordninja.split(x) for x in words_to_split])
# Some splits may be nonsensical, so reject them ("llouis" -> ['ll', 'ou', "is"])
clean_words.extend(spell.known(split_words))
return clean_words
def get_log_likelihood(doc1, doc2):
doc1_counts = Counter(doc1)
doc1_freq = {
x: doc1_counts[x]/len(doc1)
for x in doc1_counts
}
doc2_counts = Counter(doc2)
doc2_freq = {
x: doc2_counts[x]/len(doc2)
for x in doc2_counts
}
doc_ratios = {
# 1 is added to prevent division by 0
x: math.log((doc1_freq[x] +1 )/(doc2_freq[x]+1))
for x in doc1_freq if x in doc2_freq
}
top_ratios = Counter(doc_ratios).most_common()
top_percent = int(0.1 * len(top_ratios))
return top_ratios[:top_percent]
# Function to generate a document based on likelihood values for words
def get_scaled_list(log_list):
counts = [int(x[1]*100000) for x in log_list]
words = [x[0] for x in log_list]
cloud = []
for i, word in enumerate(words):
cloud.extend([word]*counts[i])
# Shuffle to make it more "real"
random.shuffle(cloud)
return cloud
# Convert string to a list of words
vax_tweets['words'] = vax_tweets.text.astype(str).apply(lambda x:re.findall(r'\w+', x ))
def get_smart_clouds(df):
neg_doc = flatten_list(df[df['sentiment']=='negative']['words'])
neg_doc = [x for x in neg_doc if is_acceptable(x)]
pos_doc = flatten_list(df[df['sentiment']=='positive']['words'])
pos_doc = [x for x in pos_doc if is_acceptable(x)]
neu_doc = flatten_list(df[df['sentiment']=='neutral']['words'])
neu_doc = [x for x in neu_doc if is_acceptable(x)]
# Clean all the documents
neg_doc_clean = clean_document(neg_doc)
neu_doc_clean = clean_document(neu_doc)
pos_doc_clean = clean_document(pos_doc)
# Combine classes B and C to compare against A (ex. "positive" vs "non-positive")
top_neg_words = get_log_likelihood(neg_doc_clean, flatten_list([pos_doc_clean, neu_doc_clean]))
top_neu_words = get_log_likelihood(neu_doc_clean, flatten_list([pos_doc_clean, neg_doc_clean]))
top_pos_words = get_log_likelihood(pos_doc_clean, flatten_list([neu_doc_clean, neg_doc_clean]))
# Generate syntetic a corpus using our loglikelihood values
neg_doc_final = get_scaled_list(top_neg_words)
neu_doc_final = get_scaled_list(top_neu_words)
pos_doc_final = get_scaled_list(top_pos_words)
# Visualise our synthetic corpus
generate_word_clouds(neg_doc_final, neu_doc_final, pos_doc_final)
get_smart_clouds(vax_tweets)
This looks pretty good! The positive tweets appear to be from people who have just received their first vaccine or are grateful for the job scientists and healthcare workers are doing, whereas the negative tweets seem to be from people who have suffered adverse reactions to the vaccine. The neutral tweets seem to be more like news, which could explain why it is the most prevelant sentiment; in fact, the vast majority of tweets contain urls:
vax_tweets['has_url'] = np.where(vax_tweets['orig_text'].str.contains('http'), 'yes', 'no')
vax_tweets['has_url'].value_counts(normalize=True).plot.bar(title='Does the tweet contain a url?');
Interestingly, Canada shows up in the negative word cloud, as well as a couple of Canadian cities. Looking at a 'naive' word cloud for tweets containing 'Canada' shows us that this appears to be a political/economic issue:
def get_cloud(df, string, c_func):
string_l = string.lower()
df[string_l] = np.where(df['text'].str.lower().str.contains(string_l), 1, 0)
cloud_df = df.copy()[df[string_l]==1]
doc = flatten_list(cloud_df['words'])
doc = [x for x in doc if is_acceptable(x)]
doc = clean_document(doc)
fig, axes = plt.subplots(figsize=(9,5))
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(doc))
axes.imshow(wordcloud.recolor(color_func=c_func, random_state=3), interpolation='bilinear')
axes.set_title("Naive word cloud for tweets Containg '%s'" % (string))
axes.axis("off")
plt.show();
get_cloud(vax_tweets, 'Canada', red_color_func)
At the time of writing Canada's vaccination progress has been slower than other developed nations, and people are predicting that it might have an impact on Canada's economic recovery:
countries = ['Canada', 'United Kingdom', 'United States', 'Chile', 'Singapore', 'Israel', 'Australia']
selected = vax_progress[vax_progress['country'].isin(countries)]
eu['country'] = 'EU median'
fig = px.line(vax_progress[vax_progress['country'].isin(countries)].append(eu), x='date', y='daily_vaccinations_per_million', color='country',
title='Daily vaccinations per million (all vaccines) in Canada vs selected other developed nations')
fig.show()
Conclusion
We were able to gain some interesting insights here, so hopefully you found this useful! That said, there is still a lot left to explore, especially since vaccinations are ongoing and the dataset is still being updated at the time of writing (thanks once again to Gabriel Preda for providing the data).
If you made it this far, I encourage you to give this task a go yourself and see what you can find out! A couple of suggestions:
- Try to improve the accuracy of the
fastai
models we created in part 1. - Instead of looking at each vaccine individually, investigate each vaccination scheme (most countries are using more than one vaccine).
- Dig deeper on the sentiment in a specific country and how that relates to vaccination progress. You could even analyse a large dataset of all COVID-19 tweets, not just vaccine specific ones!
- Investigate adverse reactions to the vaccine and how that is reflected tweet sentiment. For instance, is blood clotting really a concern for patients who have received the Oxford/AstraZeneca vaccine?
Thanks for reading!
1. Cover image via https://www.mamamia.com.au/covid-19-vaccine-latest-update/↩