All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Read the data in fifa_world_cup_2022_tweets.csv
into a dataframe, but only the columns Tweet
and Sentiment
.
Your df
should look something like:
Create a new column Tweet Lower
that contains the contents of the Tweet
column, but all lowercased.
URLs, hashtags, mentions are mostly useless elements when it comes to sentiment analysis. We'll start by removing all the URLs. Remove all the URLs from Tweet Lower
and store your results in Tweet Clean
.
Warning! Don't forget to remove any leading or trailing whitespaces. For example, if you remove the URL from the following tweet:
what are we drinking today @tucantribe
@madbears_
@lkinc_algo
@al_goanna
#worldcup2022 https://t.co/oga3tzvg5h
The result should be:
"""what are we drinking today @tucantribe
@madbears_
@lkinc_algo
@al_goanna
#worldcup2022"""
Without a trailing space after the #worldcup2022
hashtag.
Still in Tweet Clean
, remove any twitter mentions (in the form @datawars_io
). In this case, we're modifying the original column Tweet Clean
, so if you make a mistake, you'll have to re-run your previous code and start over.
Remember to strip any trailing or leading whitespaces.
Still in Tweet Clean
, remove any hashtags.
We'll now start using the nltk
module. Don't worry if you've never used it before, as these are all simple functions that don't require an NLP background.
We'll start by "tokenizing" the tweets. Tokenizing means basically splitting a corpus of text into different words or tokens.
Your task is to use the word_tokenize
function to create a list of tweet tokens and store the result in tokenized_tweets
. This means that tokenized_tweets
is a list of lists, a list of tokens in the following form:
[
['what', 'are', 'we', 'drinking', 'today'], # tweet
['worth', 'reading', 'while', 'watching'], # tweet
]
Stop words are words that don't contribute much to the meaning of a sentence, like conjunctions ("for", "and") or the word "the", "a", etc. The nltk
module contains stop words for english, that we can get with stopwords.words('english')
.
Your task is to remove any stop words from the tokens you have previously generated. Store your results in the variable filtered_tokenized_tweets
, which continues to be a list of lists, but with the stop words filtered out.
Use a single space to concat the tokens that we have preprocessed in our previous tasks and build the tweet again. Store your results in the variable cleaned_tweets
. In this case, it'll no longer be a list of lists, but a list of strings, the tweets we have assembled again, and it'll look something like:
['drinking today',
'amazing launch video . shows much face canada men ’ national team changed since last world cup entry 1986. ’ wait see boys action ! canada : fifa world cup opening video',
'worth reading watching']
Use the analyzer.polarity_scores
method to perform sentiment analysis on all the tweets in cleaned_tweets
. Store the list of results in the variable tweet_sentiment_scores
.
As we mentioned before, this requires just a method invocation:
>>> analyzer.polarity_scores(YOUR_TWEET)
Your tweet_sentiment_scores
variable will look something like:
[
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
{'neg': 0.0, 'neu': 0.864, 'pos': 0.136, 'compound': 0.6239},
{'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'compound': 0.2263},
...
]
The result of analyzer.polarity_scores
is a dictionary with several keys:
>>> analyzer.polarity_scores("DataWars is awesome! I love it so much!")
{'neg': 0.0, 'neu': 0.36, 'pos': 0.64, 'compound': 0.8715}
The neg
, neu
and pos
keys represent the proportions of the text that fall in each category (Negative, Neutral and Positive). They add up to 1.
But the key that we're really interested in is compound
, which is a weighted composite score that has been normalized between -1 (most extreme negative) and +1 (most extreme positive). In this case, the compound score of 0.8715 indicates a very high positive sentiment.
The general rule of thumb for interpreting the compound
score is:
compound
score > 0.05compound
score between -0.05 and 0.05compound
score < -0.05Calculate the sentiment of each score and store it in the variable tweet_sentiment_results
that should look something like: ['neutral', 'positive', 'positive', ...]
.
Remove the columns we previously used (Tweet Lower
, Tweet Clean
) and create a new one named Calculated Sentiment
with the results of tweet_sentiment_results
.
Assuming the column Sentiment
had the correct sentiment, how many did we classified erroneously in our Calculated Sentiment
column?