Analysing overall sentiment

In part 1, we trained a sentiment classification model and used it to predict the sentiment of tweets about COVID-19 vaccines. Our focus in this part will be to analyse the results from our model.

Note: As mentioned in part 1, this is a write-up of a submission I made for several Kaggle tasks, which are still open and accepting new entries at the time of writing if you want to give them a go yourself! See the conclusion for some ideas.

First, let's load in the data from part 1 and plot the frequency of each sentiment.

vax_tweets = pd.read_csv('https://raw.githubusercontent.com/twhelan22/blog/master/data/vax_tweets_inc_sentiment.csv', index_col=0, parse_dates=['date'])

# Plot sentiment value counts
vax_tweets['sentiment'].value_counts(normalize=True).plot.bar(title='COVID-19 vaccine tweet sentiment');

We can see that the predominant sentiment is neutral, with more positive tweets than negative. It's encouraging that negative sentiment isn't higher! We can also visualise how sentiment changes over time:

# Get counts of number of tweets by sentiment for each date
timeline = vax_tweets.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index().dropna()

# Plot results
fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']},
             title='Timeline showing sentiment of tweets about COVID-19 vaccines')
fig.show()

There was a big spike in the number of tweets on March 1st 2021, so let's investigate further. A lot of the tweets appear to be from users in India:

spike = vax_tweets[vax_tweets['date'].astype(str)=='2021-03-01']
spike['user_location'].value_counts(ascending=False).head(10)
India               258
New Delhi, India    138
patna                52
Mumbai, India        48
New Delhi            46
Bengaluru, India     32
Mumbai               28
Delhi                26
Hyderabad, India     24
Pune, India          22
Name: user_location, dtype: int64

spike = spike.sort_values('user_location', ascending=False)
spike['orig_text'].head()
18084    Before magreact, do the research how the vacci...
17555    I find this Photo by @cpimspeak\nTo be offensi...
15285    🇮🇳 PM Shri @narendramodi took his first dose o...
16532    Got call at 9 am from health department and mo...
16901    #mRNAvaccine #PfizerBionTech\n#Moderna #Katali...
Name: orig_text, dtype: object

It looks like Indian Prime Minister Narendra Modi received the first dose of Indian developed Covaxin on 1st March. No wonder there were lots of tweets! To dig deeper, let's plot timelines for each vaccine indvidually.

Timeline analysis for each vaccine

Covaxin

all_vax = ['covaxin', 'sinopharm', 'sinovac', 'moderna', 'pfizer', 'biontech', 'oxford', 'astrazeneca', 'sputnik']

# Function to filter the data to a single vaccine
# Note: a lot of the tweets seem to contain hashtags for multiple vaccines even though they are specifically referring to one vaccine - not very helpful!
def filtered_df(df, vax):
    df = df.dropna()
    df_filt = pd.DataFrame()
    for o in vax:
        df_filt = df_filt.append(df[df['orig_text'].str.lower().str.contains(o)])
    other_vax = list(set(all_vax)-set(vax))
    for o in other_vax:
        df_filt = df_filt[~df_filt['orig_text'].str.lower().str.contains(o)]
    df_filt = df_filt.drop_duplicates()
    return df_filt

# Function to plot the timeline
def plot_timeline(df, title):
    title_str = 'Timeline showing sentiment of tweets about the '+title+' vaccine'
    timeline = df.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()
    fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']}, title=title_str)
    fig.show()
    
covaxin = filtered_df(vax_tweets, ['covaxin'])
plot_timeline(covaxin, title='Covaxin')

# Function to filter the data to a single date and print tweets from users with the most followers
def date_filter(df, date):
    return df[df['date'].astype(str)==date].sort_values('user_followers', ascending=False)[['date' ,'orig_text']]

def date_printer(df, dates, num=10): 
    for date in dates:
        display(date_filter(df, date).head(num))

date_printer(covaxin, ['2021-03-01', '2021-03-03'])
date orig_text
18936 2021-03-01 "Felt secure, will travel safely" EAM @DrSJais...
17463 2021-03-01 #Watch | PM @NarendraModi was administered the...
13382 2021-03-01 @nistula Sources in the govt say PM #NarendraM...
13107 2021-03-01 PM #NarendraModi took the first shot of #COVAX...
18912 2021-03-01 There are two #CovidVaccines that are being us...
18960 2021-03-01 External Affairs Minister Jaishankar receives ...
18750 2021-03-01 A 100-year-old resident of #Hyderabad, Jaidev ...
18700 2021-03-01 #PMModi took the first does of #Covid19 vaccin...
18666 2021-03-01 #PMModi took the first dose of #Covaxin today....
18803 2021-03-01 #PMModi flagged off the second phase of #Covid...
date orig_text
20792 2021-03-03 #Covaxin 81% Effective, Works Against UK Varia...
20403 2021-03-03 “The numbers are extremely promising at this s...
20388 2021-03-03 “The data is quite encouraging”: Dr Rachna Kuc...
20696 2021-03-03 #Covaxin's Phase 3 Trial Results Out! #Covid19...
20411 2021-03-03 For those like me who were concerned that #Cov...
20563 2021-03-03 #Covaxin demonstrates the prowess of Atmanirbh...
20349 2021-03-03 India's vaccine maker Bharat Biotech said Wed ...
20850 2021-03-03 Bharat Biotech announces phase 3 results of Co...
20380 2021-03-03 #Covaxin is one of the two vaccines that have ...
20671 2021-03-03 .@BharatBiotech announces the phase 3 results ...

Modi wasn't the only person to make news on March 1st; India's External Affairs Minister and a 100-year-old Hyderabad resident also received their first dose of Covaxin. On March 3rd, phase 3 trial results for Covaxin were published, showing 81% efficacy. It makes sense for there to be a spike in the number of neutral and positive tweets about Covaxin on those dates!

Sinovac

sinovac = filtered_df(vax_tweets, ['sinovac'])
plot_timeline(sinovac, title='Sinovac')

Some notable dates:

date_printer(sinovac, ['2021-02-22', '2021-02-28', '2021-03-01', '2021-03-03', '2021-03-08'], 3)
date orig_text
11715 2021-02-22 Thai PM Prayut Chan-o-cha possibly among first...
11757 2021-02-22 Carrie Lam, Chief Executive of #HongKong SAR, ...
11765 2021-02-22 The #Philippines has officially approved the e...
date orig_text
16270 2021-02-28 #Thai deputy PM and ministers are part of the ...
16253 2021-02-28 China has provided Mexico with 1 million doses...
16254 2021-02-28 Second batch of #Sinovac vaccines produced by ...
date orig_text
16806 2021-03-01 #Philippine General Hospital (PGH) Director Dr...
16779 2021-03-01 The #Philippines kicked off vaccination drive ...
16818 2021-03-01 A batch of #Sinovac #vaccine donated by China ...
date orig_text
19152 2021-03-03 Brazilian soccer legend #Pele on Tuesday recei...
19162 2021-03-03 In pics: Raw materials for China's #Sinovac #C...
19175 2021-03-03 It is extremely unlikely that the death of a 6...
date orig_text
23448 2021-03-08 The second batch of China's #Sinovac COVID-19 ...
23834 2021-03-08 China's #Sinovac #covid19 #vaccines show an 80...
23836 2021-03-08 #Sinovac’s #vaccine shows an 80-90% efficacy r...

These tweets are about countries starting their vaccination programme or receiving a new shipment of vaccines. Let's use the 'COVID-19 World Vaccination Progress' dataset to plot daily vaccinations for the mentioned countries:

vax_progress = pd.read_csv('https://raw.githubusercontent.com/twhelan22/blog/master/data/country_vaccinations.csv', index_col=0, parse_dates=['date'])
countries = ['Brazil', 'Thailand', 'Hong Kong', 'Colombia', 'Mexico', 'Philippines', 'Indonesia']
fig = px.line(vax_progress[vax_progress['country'].isin(countries)], x='date', y='daily_vaccinations_per_million', color='country',
             title='Daily vaccinations per million (all vaccines) in selected countries')
fig.show()

We can see that daily vaccinations per million increased significantly in Colombia and Mexico after they received new shipments of vaccines. Daily vaccinations are also increasing rapidly in Hong Kong after Carrie Lam received the vaccine on February 22nd; however, progress has been slower in Thailand and the Philippines so far.

Sinopharm

sinopharm = filtered_df(vax_tweets, ['sinopharm'])
plot_timeline(sinopharm, title='Sinopharm')

As with Sinovac, most of the Sinopharm tweets appear to be positive news regarding countries receiving a shipment of the vaccine:

date_printer(sinopharm, ['2021-02-18', '2021-02-24', '2021-03-02'], 3)
date orig_text
9905 2021-02-18 #Senegal received its #COVID19 vaccines purcha...
10391 2021-02-18 Nepal has granted approval to China’s #Sinopha...
10380 2021-02-18 With the #Sinopharm #vaccine, Hungarians will ...
date orig_text
12655 2021-02-24 #Sinopharm's second COVID-19 vaccine has a 72....
13958 2021-02-24 The first batch of China's #Sinopharm vaccine ...
12680 2021-02-24 #Senegal on Tuesday officially began the first...
date orig_text
17972 2021-03-02 The first batch of #Sinopharm #COVID19 vaccine...
17977 2021-03-02 China will provide 50,000 inactivated #Sinopha...
17945 2021-03-02 #Iraq received its first 50,000 doses of the #...

countries = ['Senegal', 'Nepal', 'Hungary', 'Bolivia', 'Lebanon']
fig = px.line(vax_progress[vax_progress['country'].isin(countries)], x='date', y='daily_vaccinations_per_million', color='country',
             title='Daily vaccinations per million (all vaccines) in selected countries')
fig.show()

We can see that Hungary ramped up their vaccination programme after the news on February 18th that they would become the first EU country to start administering Sinopharm. In addition, Senegal started vaccinating shortly after positive tweets confirmed that they had received a shipment of Sinopharm vaccines. Unfortunately there is no data for Iraq, but they also started their programme just hours after receiving a donation of vaccines from China.

Moderna

moderna = filtered_df(vax_tweets, ['moderna'])
plot_timeline(moderna, title='Moderna')

Some notable dates:

date_printer(moderna, ['2021-02-17', '2021-03-05', '2021-03-11'], 3)
date orig_text
9458 2021-02-17 #UPDATE The European Union has bought up to 30...
9464 2021-02-17 #Covid19: EU approves contract for 300 million...
9471 2021-02-17 💉🇪🇺 The European Commission said on Wednesday ...
date orig_text
22413 2021-03-05 #Japan’s Takeda Pharmaceutical Co asks regulat...
22422 2021-03-05 #Moderna To Collaborate With IBM On #COVID19Va...
21191 2021-03-05 Moderna COVID-19 Vaccine Recipients Experience...
date orig_text
27203 2021-03-11 @mpetrillo59 It was the #Moderna.
27207 2021-03-11 I got the #CovidVaccine today.\nI received the...
27370 2021-03-11 🇺🇸Utah mother, 39, with NO known health issues...

On March 2nd Dolly Parton received her dose of the vaccine she helped fund, which explains the initial increase in positive tweets prior to the news about Moderna's collaboration with IBM. By looking at the vaccine progress, we can see that the median daily vaccinations per million in EU countries started to pull further ahead of the rest of the world after news that they would purchase up to 300m extra Moderna vaccines:

countries = ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark', 
             'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 
             'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal', 
             'Romania', 'Slovakia', 'Slovenia', 'Spain','Sweden']
eu = vax_progress[vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
eu['region'] = 'EU'
row = vax_progress[~vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
row['region'] = 'Rest of world'
fig = px.line(eu.append(row), x='date', y='daily_vaccinations_per_million', color='region',
             title='Median daily vaccinations per million (all vaccines) in EU countries vs the rest of the world')

fig.add_vline(x='2021-02-17', line_width=3, line_dash='dash', line_color='#00cc96')

fig.add_annotation(x='2021-02-17', y=2120,
            text="EU makes a deal to purchase up to 300m extra Moderna vaccines",
            showarrow=True,
            arrowhead=5, ax=-220, ay=-30)

fig.show()

Sputnik V

sputnikv = filtered_df(vax_tweets, ['sputnik'])
plot_timeline(sputnikv, title='Sputnik V')

Some notable dates:

date_printer(sputnikv, ['2021-03-04', '2021-03-05', '2021-03-10', '2021-03-11', '2021-03-15'], 3)
date orig_text
21932 2021-03-04 European Union drug regulator on Thursday star...
22012 2021-03-04 Sputnik V could be India's third #Covid19 vacc...
21991 2021-03-04 Sputnik V Could Be India’s 3rd COVID Vaccine: ...
date orig_text
22850 2021-03-05 [Coronavirus] EU's medicines agency @EMA_News ...
22865 2021-03-05 #SputnikV is now the world's second most popul...
22876 2021-03-05 Twitter officially Verified #SputnikV account....
date orig_text
30610 2021-03-10 Iran and Russia will start to jointly produce ...
30745 2021-03-10 #Russia has signed a deal to produce its #Sput...
30748 2021-03-10 #SputnikV has not yet been approved for use in...
date orig_text
30512 2021-03-11 Best #SputnikV4Victory photos will be publishe...
30513 2021-03-11 #SputnikV, approved by 50 countries, brings vi...
30494 2021-03-11 Anti-#covid19 update:\n\n🇰🇪Kenya, 🇲🇦Morocco, 🇯...
date orig_text
30153 2021-03-15 #NewsAlert | #SputnikV production agreements r...
30088 2021-03-15 The developers of the #SputnikV #coronavirus #...
30044 2021-03-15 @Malinka1102 Salam, here is your unroll: #Russ...

We can see spikes in positive sentiment after various countries agreed to produce the Sputkik V vaccine, and on March 11th after ABC news reported that it was the safest vaccine.

Pfizer/BioNTech

pfizer = filtered_df(vax_tweets, ['pfizer', 'biontech'])
plot_timeline(pfizer, title='Pfizer/BioNTech')

There is a lot to unpack here, so to make things easier let's annotate some of the key dates:

timeline = pfizer.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()

fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']},
              title='Timeline showing sentiment of tweets about the PfizerBioNTech vaccine')

fig.add_annotation(x='2020-12-14', y=timeline[(timeline['date']=='2020-12-14')&(timeline['sentiment']=='positive')]['tweets'].values[0],
            text="USA and UK start vaccinating",
            showarrow=True,
            arrowhead=3, ax=55, ay=-210)

fig.add_annotation(x='2020-12-22', y=timeline[(timeline['date']=='2020-12-22')&(timeline['sentiment']=='positive')]['tweets'].values[0],
            text="Joe Biden receives first dose",
            arrowhead=3, ax=10, ay=-100)

fig.add_annotation(x='2021-01-08', y=timeline[(timeline['date']=='2021-01-08')&(timeline['sentiment']=='positive')]['tweets'].values[0],
            text="Vaccine shown to resist new variant",
            showarrow=True, align='left',
            arrowhead=3, ax=0, ay=-45)

fig.add_annotation(x='2021-01-16', y=timeline[(timeline['date']=='2021-01-16')&(timeline['sentiment']=='negative')]['tweets'].values[0],
            text="23 elderly Norwegians die after vaccine dose",
            showarrow=True, align='left',
            arrowhead=3, ax=15, ay=-180)

fig.add_annotation(x='2021-02-19', y=timeline[(timeline['date']=='2021-02-19')&(timeline['sentiment']=='positive')]['tweets'].values[0],
            text="Israeli study shows 85% efficacy after one dose",
            showarrow=True, align='left',
            arrowhead=3, ax=-30, ay=-180)

fig.add_annotation(x='2021-02-25', y=timeline[(timeline['date']=='2021-02-25')&(timeline['sentiment']=='positive')]['tweets'].values[0],
            text="Israeli study shows 94% efficacy after two doses",
            showarrow=True, align='left',
            arrowhead=3, ax=-20, ay=-130)

fig.show()

Oxford/AstraZeneca

oxford = filtered_df(vax_tweets, ['oxford', 'astrazeneca'])
plot_timeline(oxford, title='Oxford/AstraZeneca')

Interestingly, there are small positive spikes on February 19th and March 6th, with people tweeting after receiving the vaccine:

date_printer(oxford, ['2021-02-19', '2021-03-06'], 5)
date orig_text
10616 2021-02-19 Had my 1st dose of the vaccine. Very impressed...
11107 2021-02-19 @nicolab03 Hurrah! Had mine today too. #Oxford...
11108 2021-02-19 Blimey I feel crap. \n\nBut it’s totally worth...
10617 2021-02-19 #vaccine study #nurses #volunteers\n#oxfordast...
11096 2021-02-19 Our latest paper on doses of the #oxfordastraz...
date orig_text
23213 2021-03-06 “The #OxfordAstraZeneca #CovidVaccine develop...
23216 2021-03-06 Drive through #Whitstable for #NewHusband #Oxf...
22480 2021-03-06 @BWildeMTL Hope it gets sorted soon Brian. I h...
23211 2021-03-06 Update... If you look closely, you'll see wher...
22450 2021-03-06 EU seeks to access AstraZeneca vaccines produc...

However, negative sentiment is increasing after numerous countries have suspended the use of the vaccine over safety concerns. We can see that vaccination progress in these countries has slowed significantly over the past few days as a result:

# At the time of writing, these countries have completely suspended the use of the vaccine
# Note that several other countries continued mostly as normal but suspended the use of one batch of Oxford/AstraZeneca vaccines
countries = ['Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Ireland', 'Denmark', 'Norway', 'Bulgaria', 'Iceland', 'Thailand']
ox_prog = vax_progress[vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
ox_prog['Use of Oxford/AstraZeneca'] = 'Suspended'
other_prog = vax_progress[vax_progress['vaccines'].str.contains('Oxford/AstraZeneca')]
other_prog = vax_progress[~vax_progress['country'].isin(countries)].groupby('date')['daily_vaccinations_per_million'].median().reset_index()
other_prog['Use of Oxford/AstraZeneca'] = 'Ongoing'
fig = px.line(ox_prog.append(other_prog), x='date', y='daily_vaccinations_per_million', color='Use of Oxford/AstraZeneca',
             title="Median daily vaccinations per million (all vaccines) in countries that have completely suspended the use of the\
              <br>Oxford/AstraZeneca vaccine vs countries that continue to use it")
fig.add_vrect(x0="2021-03-11", x1="2021-03-15", 
              annotation_text="vaccine<br>suspended", annotation_position="bottom right",
              fillcolor="limegreen", opacity=0.25, line_width=0)
fig.show()

The overall sentiment of the Oxford/AstraZeneca vaccine is therefore significantly more negative than average:

# Get z scores of sentiment for each vaccine
vax_names = {'Covaxin': covaxin, 'Sinovac': sinovac, 'Sinopharm': sinopharm,
            'Moderna': moderna, 'Oxford/AstraZeneca': oxford, 'PfizerBioNTech': pfizer}
sentiment_zscores = pd.DataFrame()
for k, v in vax_names.items():
    senti = v['sentiment'].value_counts(normalize=True)
    senti['vaccine'] = k
    sentiment_zscores = sentiment_zscores.append(senti)
for col in ['negative', 'neutral', 'positive']:
    sentiment_zscores[col+'_zscore'] = (sentiment_zscores[col] - sentiment_zscores[col].mean())/sentiment_zscores[col].std(ddof=0)
sentiment_zscores.set_index('vaccine', inplace=True)

# Plot the results
ax = sentiment_zscores.sort_values('negative_zscore')['negative_zscore'].plot.barh(title='Z scores of negative sentiment')
ax.set_ylabel('Vaccine')
ax.set_xlabel('Z score');

Further analysis using 'smarter' word clouds

The final thing we will do is to generate word clouds to see which words are indicative of each sentiment. The code below is from this notebook, which contains a more detailed explanation of the methodology used to generate 'smarter' word clouds. Please go and upvote the original notebook if you find this part useful!

!pip install -q wordninja
!pip install -q pyspellchecker
from wordcloud import WordCloud, ImageColorGenerator
import wordninja
from spellchecker import SpellChecker
from collections import Counter
import matplotlib.pyplot as plt
import re
import math
import random
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))  
stop_words.add("amp")
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

# FUNCTIONS REQUIRED
def flatten_list(l):
    return [x for y in l for x in y]

def is_acceptable(word: str):
    return word not in stop_words and len(word) > 2

# Color coding our wordclouds 
def red_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(0, 100%, {random.randint(25, 75)}%)" 

def green_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl({random.randint(90, 150)}, 100%, 30%)" 

def yellow_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
    return f"hsl(42, 100%, {random.randint(25, 50)}%)" 

# Reusable function to generate word clouds 
def generate_word_clouds(neg_doc, neu_doc, pos_doc):
    # Display the generated image:
    fig, axes = plt.subplots(1,3, figsize=(20,10))
    
    wordcloud_neg = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neg_doc))
    axes[0].imshow(wordcloud_neg.recolor(color_func=red_color_func, random_state=3), interpolation='bilinear')
    axes[0].set_title("Negative Words")
    axes[0].axis("off")

    wordcloud_neu = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neu_doc))
    axes[1].imshow(wordcloud_neu.recolor(color_func=yellow_color_func, random_state=3), interpolation='bilinear')
    axes[1].set_title("Neutral Words")
    axes[1].axis("off")

    wordcloud_pos = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(pos_doc))
    axes[2].imshow(wordcloud_pos.recolor(color_func=green_color_func, random_state=3), interpolation='bilinear')
    axes[2].set_title("Positive Words")
    axes[2].axis("off")

    plt.tight_layout()
    plt.show();

def get_top_percent_words(doc, percent):
    # Returns a list of "top-n" most frequent words in a list 
    top_n = int(percent * len(set(doc)))
    counter = Counter(doc).most_common(top_n)
    top_n_words = [x[0] for x in counter]
    
    return top_n_words
    
def clean_document(doc):
    spell = SpellChecker()
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize words (needed for calculating frequencies correctly )
    doc = [lemmatizer.lemmatize(x) for x in doc]
    
    # Get the top 10% of all words. This may include "misspelled" words 
    top_n_words = get_top_percent_words(doc, 0.1)

    # Get a list of misspelled words 
    misspelled = spell.unknown(doc)
    
    # Accept the correctly spelled words and top_n words 
    clean_words = [x for x in doc if x not in misspelled or x in top_n_words]
    
    # Try to split the misspelled words to generate good words (ex. "lifeisstrange" -> ["life", "is", "strange"])
    words_to_split = [x for x in doc if x in misspelled and x not in top_n_words]
    split_words = flatten_list([wordninja.split(x) for x in words_to_split])
    
    # Some splits may be nonsensical, so reject them ("llouis" -> ['ll', 'ou', "is"])
    clean_words.extend(spell.known(split_words))
    
    return clean_words

def get_log_likelihood(doc1, doc2):    
    doc1_counts = Counter(doc1)
    doc1_freq = {
        x: doc1_counts[x]/len(doc1)
        for x in doc1_counts
    }
    
    doc2_counts = Counter(doc2)
    doc2_freq = {
        x: doc2_counts[x]/len(doc2)
        for x in doc2_counts
    }
    
    doc_ratios = {
        # 1 is added to prevent division by 0
        x: math.log((doc1_freq[x] +1 )/(doc2_freq[x]+1))
        for x in doc1_freq if x in doc2_freq
    }
    
    top_ratios = Counter(doc_ratios).most_common()
    top_percent = int(0.1 * len(top_ratios))
    return top_ratios[:top_percent]

# Function to generate a document based on likelihood values for words 
def get_scaled_list(log_list):
    counts = [int(x[1]*100000) for x in log_list]
    words = [x[0] for x in log_list]
    cloud = []
    for i, word in enumerate(words):
        cloud.extend([word]*counts[i])
    # Shuffle to make it more "real"
    random.shuffle(cloud)
    return cloud

# Convert string to a list of words
vax_tweets['words'] = vax_tweets.text.astype(str).apply(lambda x:re.findall(r'\w+', x ))

def get_smart_clouds(df):

    neg_doc = flatten_list(df[df['sentiment']=='negative']['words'])
    neg_doc = [x for x in neg_doc if is_acceptable(x)]

    pos_doc = flatten_list(df[df['sentiment']=='positive']['words'])
    pos_doc = [x for x in pos_doc if is_acceptable(x)]

    neu_doc = flatten_list(df[df['sentiment']=='neutral']['words'])
    neu_doc = [x for x in neu_doc if is_acceptable(x)]

    # Clean all the documents
    neg_doc_clean = clean_document(neg_doc)
    neu_doc_clean = clean_document(neu_doc)
    pos_doc_clean = clean_document(pos_doc)

    # Combine classes B and C to compare against A (ex. "positive" vs "non-positive")
    top_neg_words = get_log_likelihood(neg_doc_clean, flatten_list([pos_doc_clean, neu_doc_clean]))
    top_neu_words = get_log_likelihood(neu_doc_clean, flatten_list([pos_doc_clean, neg_doc_clean]))
    top_pos_words = get_log_likelihood(pos_doc_clean, flatten_list([neu_doc_clean, neg_doc_clean]))

    # Generate syntetic a corpus using our loglikelihood values 
    neg_doc_final = get_scaled_list(top_neg_words)
    neu_doc_final = get_scaled_list(top_neu_words)
    pos_doc_final = get_scaled_list(top_pos_words)

    # Visualise our synthetic corpus
    generate_word_clouds(neg_doc_final, neu_doc_final, pos_doc_final)
    
get_smart_clouds(vax_tweets)

This looks pretty good! The positive tweets appear to be from people who have just received their first vaccine or are grateful for the job scientists and healthcare workers are doing, whereas the negative tweets seem to be from people who have suffered adverse reactions to the vaccine. The neutral tweets seem to be more like news, which could explain why it is the most prevelant sentiment; in fact, the vast majority of tweets contain urls:

vax_tweets['has_url'] = np.where(vax_tweets['orig_text'].str.contains('http'), 'yes', 'no')
vax_tweets['has_url'].value_counts(normalize=True).plot.bar(title='Does the tweet contain a url?');

Interestingly, Canada shows up in the negative word cloud, as well as a couple of Canadian cities. Looking at a 'naive' word cloud for tweets containing 'Canada' shows us that this appears to be a political/economic issue:

def get_cloud(df, string, c_func):
    string_l = string.lower()
    df[string_l] = np.where(df['text'].str.lower().str.contains(string_l), 1, 0)
    cloud_df = df.copy()[df[string_l]==1]
    doc = flatten_list(cloud_df['words'])
    doc = [x for x in doc if is_acceptable(x)]
    doc = clean_document(doc)
    fig, axes = plt.subplots(figsize=(9,5))
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(doc))
    axes.imshow(wordcloud.recolor(color_func=c_func, random_state=3), interpolation='bilinear')
    axes.set_title("Naive word cloud for tweets Containg '%s'" % (string))
    axes.axis("off")
    plt.show();
    
get_cloud(vax_tweets, 'Canada', red_color_func)

At the time of writing Canada's vaccination progress has been slower than other developed nations, and people are predicting that it might have an impact on Canada's economic recovery:

countries = ['Canada', 'United Kingdom', 'United States', 'Chile', 'Singapore', 'Israel', 'Australia']
selected = vax_progress[vax_progress['country'].isin(countries)]
eu['country'] = 'EU median'
fig = px.line(vax_progress[vax_progress['country'].isin(countries)].append(eu), x='date', y='daily_vaccinations_per_million', color='country',
             title='Daily vaccinations per million (all vaccines) in Canada vs selected other developed nations')
fig.show()

Conclusion

We were able to gain some interesting insights here, so hopefully you found this useful! That said, there is still a lot left to explore, especially since vaccinations are ongoing and the dataset is still being updated at the time of writing (thanks once again to Gabriel Preda for providing the data).

If you made it this far, I encourage you to give this task a go yourself and see what you can find out! A couple of suggestions:

  1. Try to improve the accuracy of the fastai models we created in part 1.
  2. Instead of looking at each vaccine individually, investigate each vaccination scheme (most countries are using more than one vaccine).
  3. Dig deeper on the sentiment in a specific country and how that relates to vaccination progress. You could even analyse a large dataset of all COVID-19 tweets, not just vaccine specific ones!
  4. Investigate adverse reactions to the vaccine and how that is reflected tweet sentiment. For instance, is blood clotting really a concern for patients who have received the Oxford/AstraZeneca vaccine?

Thanks for reading!