May 13, 2021
The assignment for our Advance Data Science and Predictive Analytics class was to analyze the twitter sentiment of supporting small businesses in Ontario. A relevant topic during covid and one worthy of exploring, especially during all the lock-downs Ontario has been facing over the past year plus.
The first thing you do when you want to analyze twitter data is to extract tweets. We started off trying to extract tweets using the Twitter API. These attempts generated approximately 600 tweets, specifically from Canada, using the geo location column within twitter. As most twitter users don't disclose this geographic information publicly, the number of tweets was a little sparse. Also, the use of the free Twitter API account only allows for tweets made in the last seven days, again, not helping us much in terms of volume. We wanted approximately 2000 tweets.
Instead, we used a web scraping library in python, known as Twint, to pull the tweets. To get the roughly 2000 tweets we analyzed, we decided to use seven popular hashtags used when supporting small businesses. The hashtags were #shoplocal, #shopsmall, #smallbiz, #smallbusiness, #supportlocal, #suppertlocalbusiness and #supportsmallbusiness. Using the hashtags alone was not enough though, as this would get tweets with these hashtags from all over the world. We wanted to find tweets specifically from Ontario specifically. As a result, we combined the hashtag scrape with a list of the 100 largest cities within Ontario by population, excluding cities with a possible cross contamination such as London, Windsor, Cambridge. This meant that all the tweets we collected would have one or more the hashtags, and somewhere mention an Ontario city.
Using Twint, we were also able to grab tweets older than seven days. We decided to take a snapshot of the month of February 2021 (this was when our assignment was due). We used the dates February 01 to 23, 2021. This was done for two reasons. Firstly, the extended period yielded us with a much higher number of tweets. We wanted about 2000. Secondly, we could conduct a sentiment trend analysis over a 3-week period which would include the lifting of the lockdown for much of Ontario as well as Valentine’s Day and Family Day.
The approach resulted in over 2,000 tweets. Once duplicates were removed and the language was filtered to English the total number of tweets to be analysed was 1,960.
One way to analyse text is to create a word cloud. A word cloud is a visualization of the most common words in a document, or in this case, a batch of tweets. The words appearing most often are larger, giving them a louder voice. To create a Word Cloud, the text requires cleansing. All words need to be converted to lowercase, and punctuation, numbers, emojis and URLs are removed. Common words, such as “it, we, you and only” are removed as they do not provide any insight to the analyses. The data cleansing and word clouds were run in R Studio.
Three word clouds were created for analysis. The first word cloud includes all the hashtags that were used in the twitter scrape. The hashtags were very dominant, especially shoplocal, supportlocal and smallbusiness. The second word cloud removed the seven hashtag terms used in the initial data scrape resulting in the city names being more dominant, such as Toronto, Ottawa, Niagara. The final word cloud excluded both the hashtags and the cities to give a much clearer view of what words are associated with small business such as online, handmade, support. The cloud also lists twitter campaigns such as dineniagara.
In the word cloud without hashtags or cities you can also see what services small businesses are using on twitter, such as instagram and etsy, and ways they are providing the services, including takeout and delivery. No real surprises, but I love a good word cloud, and think R makes some really pretty ones.
Regarding the middle word cloud, some props should be given to small cities such as Cornwall (280), Niagara (255) and Essex (239) as they had more mentions than much larger cities such as Mississauga (126) and Hamilton (91).
As the word cloud to the right demonstrates, another shout-out should go to the robust marketing on Twitter by individual companies, such as muttlifeinc (a dog store in Milton with 121 appearances) and janadiannation (a hemp and apparel company with 98 appearances).
Sentiment analysis, unlike word clouds, uses the entire tweet, including punctuation and emojis. Punctuation and emojis are used to accentuate text and weigh it accordingly. For our analysis we used three different types of sentiment analysis; base sentiment analysis, naïve bayes analysis and VADER analysis.
For base sentiment analysis, TextBlob, an open-source python library, uses an algorithm to determine whether the tweet is positive or negative. The result is a polarity and a subjectivity score. Polarity provides a score, between –1 and 1, where 1 represents positive and –1 as negative. The graph below shows that while a lot of the tweets had neutral 0 score, the vast majority of the tweets are on the positive side of 0 indicating positive feelings. The sentiment is positive towards small businesses in Ontario.
We also used Naïve bayes analysis. Naïve bayes is an algorithm that uses training data to learn how to classify the text as either positive or negative, then determine the sentiment of each tweet based on the text, punctuation and emojis. When our small business in Ontario tweets were run through the classifier, we could see that the overwhelming number of tweets were positive, with 80%. Naïve bayes analysis does not account for neutral tweets, but again, there is strong positive sentiment for small business.
The final sentiment analysis used was VADER analysis. This algorithm returns four values; positive, neutral, negative and a compound score. The sum of the probabilities of positive, neutral and negative will add to 1. The compound score is a normalized score that takes all the scores and normalizes them to be between –1 and 1. As per industry standards, generally; when the probability is greater than or equal to 0.05, the tweet is classified as positive; when the probability is between 0.05 and -0.05 the tweet is classified as neutral; when the probability is less than or equal to -0.05, the tweet is classified as negative. As you can see in the chart below, again, the majority, 69%, of tweets were considered positive, only 6% as negative and 25% were considered neutral.
The three sentiment analysis charts.
Top chart is base sentiment, showing that while a lot of the tweets have a neutral score of 0, the majority of the tweets are positive (to the right of 0) showing that the sentiment towards supporting small business in Ontario is positive.
The middle chart shows the results from our Naïve bayes analysis. Naïve bayes does not account for neutral tweets, but clearly shows that the vast majority of tweets in Ontario towards supporting small business in February of 2021 are positive.
The bottom chart shows our final sentiment analysis using VADER. Tweets were classified as positive, neutral or negative, and the vast majority of tweets, 69%, resulted in a positive classification.
The overall sentiment towards supporting small businesses in Ontario is positive. Positivity was confirmed by all sentiment analyses methods used. There is a discrepancy between the models as to how much negative sentiment exists. The highest negative sentiment score suggests it would not be greater than 20%.
Using results of the VADER analysis above, we wanted to look at the sentiment over a period of time. We plotted the sentiment both by hour of the day and days of the month. From the time-of-day chart, you can see that most tweets are positive in the morning and that very few tweets are made between 11pm and 3am. Surprisingly, tweets start making a bit of an uptick at 4am, and peaking at around 9am.
From the sentiment per day chart, we can see that Valentine’s Day had a big impact on positivity, and perhaps the big snowfall or Ontario’s decision to open most of Ontario (but not Toronto, Peel and York) around Feb 16 may have impacted neutrality and negativity over the next couple of days. Also, it's somewhat clear that Sundays and holidays are not popular days to tweet (Feb 7, 14, 21 are Sundays and Feb 15 was Family Day, a holiday, in Ontario).
Top chart shows the number of positive, neutral and negative tweets per hour of the day for most of the month of February in Ontario regarding supporting small businesses. Keep in mind, the green line is positive, the red line is neutral and the blue line, at the bottom, is negative.
Bottom chart shows the number of positive, neutral and negative tweets per day for most of the month of February in Ontario regarding supporting small businesses. You can see the peak on February 13th, the day before Valentine's Day. You can also see dips on weekends (Feb 7 is a Sunday) and Family Day (February 15). Keep in mind, the green line is positive, the red line is neutral and the blue line, at the bottom, is negative.
Whatever your attitude is towards twitter, it does offer a very useful, and hard to discover, metric. Using text-based data analysis, it is possible to find out what people are feeling about a particular topic, and even see what how that sentiment changes over time. Our report analyses twitter data for supporting small businesses in Ontario, and demonstrates where sentiment is, what cities are using twitter well and even what day in February was the most positive day for small businesses.
The very biggest flaw in this assignment is that within what we were trying to discover, sentiment towards "supporting small businesses", is a positive word, support. I have no doubt that this has an impact on our results. When the term you are trying to discover sentiment towards includes a positive word, I can't help but think this is going to bias your results. Like, imagine the results of trying to find sentiment towards "loving ice cream" or "puppies are the best". I think those too would be pretty positive. No, it's not the same, as puppies and ice cream are universally loved, but you get the idea. In the future, this assignment should instead investigate the term "small business" alone, and not the implied support. I think it would offer different, and more accurate, results in regards to actual sentiment towards small businesses in Ontario.
Our twitter scrape using all the Ontario cities (with a few exceptions such as London, Cambridge and Windsor) is not a perfect solution, as there are still many overlapping cities around the world. Still, we were surprised, it was a pretty good way of doing it. We did do a pretty quick glimpse through the 2000 tweets collected and were fairly satisfied there weren't many tweets from cities like Toronto, Ohio or Markham, Virginia.
This assignment was a part of York University School of Continuing Studies Certificate in Advance Data Science and Predictive Analytics class. Major credit should be given to my very talented classmates included Chirag, Carole, Diana and Parita, who all contributed so much to this project. We were a solid team!