Introduction
Sentiment analysis refers to the method of identification and classification of opinions expressed in a body of text. It is often used to determine the emotional tone behind a piece of writing and can be used to identify positive, negative, or neutral sentiments.
In the realm of business, sentiment analysis is generally used to analyze customer feedback, product reviews, and social media posts to help businesses understand their customers' opinions and emotions. In politics, it can be used to identify public sentiment about a particular candidate or policy. Simply put, there are a wide array of use cases where sentiment analysis can be applied.
Now, for performing sentiment analysis, there are many different techniques, ranging from simple keyword-based approaches to complex deep learning models. These methods typically rely on large datasets of human-labeled text to train the model to identify sentiment.
In general, sentiment analysis is a challenging problem due to the complex and varied ways in which people express emotions in language. However, with the development of advanced natural language processing techniques, sentiment analysis has become increasingly accurate and useful for a wide range of applications.
In this article, we will see how to perform sentiment analysis on a Twitter dataset using Python.
Note that, in this article, we will not go into the nitty-gritty of using Python for web scraping to obtain the data. If that's what you're interested in, I suggest giving this article a read, it's been covered in detail here. Alternatively, you can use an advanced scraper to gather data, and there are a few advantages to this approach, but more on that later.
Without further ado, let's begin the sentiment analysis process.
Sentiment Analysis on Twitter Dataset
Step 1: Installation & Prerequisites
For this tutorial, you will need to have:
- Python installed and basic knowledge of the language.
- Basic working knowledge of Jupyter Notebook.
- An understanding of how Python libraries such as pandas, NumPy, and Seaborn work.
Let's install and set up Jupyter Notebook on our system using the instructions below:
pip install notebook
This code will install Jupyter Notebook on your system and open it up in the browser. We will use the libraries pandas and NumPy for the data analysis process.
Next, we'll load the dataset into the Jupyter Notebook and start cleaning it.
Step 2: Loading the Dataset
First, we will import pandas and NumPy into our Jupyter environment.
import pandas as pd
import numpy as np
The next step is to load our dataset into a variable using the read()
function, which lets us open our dataset in our Jupyter environment.
data = pd.read_json('sentiment.json')
We will need to have an idea of what our dataset looks like and for that, we'll use the head() function as shown below:
data.head()
The head()
function gives us a quick overview of what our dataset looks like - the first five rows of our dataset as shown below:
We want to analyze the posts, so we'll need to separate the column from the table. To do that, we must select the 'post body' column. So we'll use square brackets to select the 'post body' column.
posts = data['post body']
posts
After this, we'll clean the dataset.
Note that there is a way to completely avoid this step. We could use an advanced web scraping tool (more on this later) to gather clean and accurate data and then perform sentiment analysis on them.
But for now, let's continue.
Step 3: Cleaning the Dataset
The next step is to create a function to clean the dataset. Cleaning the dataset helps avoid errors when performing sentiment analysis. So to create our function, we will first import the re module which will be used for cleaning our dataset.
import re
After importing the re module, we will now write our function for cleaning our dataset.
def get_url_patern():
return re.compile(r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0--9][a-za-z0--9-]+[a-zA-Z0--9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))'
r'[a-zA-Z0--9]\.[^\s]{2,}|www\.[a-zA-Z0--9]\.[^\s]{2,})')
def get_hashtags_pattern():
return re.compile(r'#\w\*')
def get_single_letter_words_pattern():
return re.compile(r'(?<![\w-])\w(?![\w-])')
def get_blank_spaces_pattern():
return re.compile(r'\s{2,}|\t')
def get_twitter_reserved_words_pattern():
return re.compile(r'(RT|rt|FAV|fav|VIA|via)')
def get_mentions_pattern():
return re.compile(r'@\w\*')
The six functions above were created to remove special characters and shorthand that may cause errors during the analysis phase.
re.compile()
is a function that looks for words in our dataset that matches the pattern or words specified in our function.
e.g.
def get_mentions_pattern():
return re.compile(r'@\w\*')
The function above looks for words that have "@"
in them and finds them in our dataset, so we have created functions to find special characters, URLs, hashtags, blank spaces, etc.
def process*tweet(word):
word=re.sub(pattern=get_url_patern(), repl="", string=word)
word=re.sub(pattern=get_mentions_pattern(), repl="", string=word)
word=re.sub(pattern=get_hashtags_pattern(), repl="", string=word)
word=re.sub(pattern=get_twitter_reserved_words_pattern(), repl='', string=word)
word=re.sub (r'http\S+', "", word) # remove http links
word=re.sub(r'bit.ly/\S+', "", word) # remove bitly links
word=word.strip('[link]') # remove [links]
word=re.sub('(RT\s@[A-Za-z]+[A-Za-z0--9-*]+)', "", word) # remove retweet
word=re.sub('(@[A-Za-z]+[A-Za-z0--9-_]+)', "", word) # remove tweeted at
word=word.encode('ascii', 'ignore').decode('ascii')
return word
The function above combines all the previous six functions to remove the special characters and shorthand. It uses the function re.sub()
.
re.sub()
is a regex function where a string replaces the specified pattern or word in the function. We will use the process tweet()
function on the dataset to remove the confusing characters.
raw_posts = [process_tweet(post) for post in posts]
print(raw_posts)
We'll observe that there are still some confusing characters in the dataset, despite the cleaning we've done on our data. The presence of the special characters means our data requires further cleaning, so we will use a Python library called little_mallet_wrapper.
So we will import the little_mallet_wrapper into our Jupyter environment, also tqdm and seaborn are libraries needed for the little_mallet_wrapper library to work
import little_mallet_wrapper
from tqdm import tqdm
import seaborn as sns
and then use little_mallet_wrapper
and the process_string()
function to clean the data set.
training_data_posts = list(little_mallet_wrapper.process_string(text, numbers='remove', remove_stop_words=False, remove_short_words=False) for text in tqdm(raw_posts))
The above process now leaves our dataset completely clean and without special characters, making performing sentiment analysis easier.
Now we can finally move on to performing sentiment analysis on our dataset.
Step 4: Performing Sentiment Analysis on the Dataset
The following steps show how we can perform sentiment analysis.
Step 4.1: Importing Necessary Libraries
The libraries TextBlob and Wordcloud will be used for sentiment analysis. So, first, we will import the libraries into our Jupyter environment as shown below:
from textblob import TextBlob
from wordcloud import WordCloud
Step 4.2: Calculating Polarity Score
Next, we'll create a function that calculates polarity score values:
def getPolarityScore(text):
return TextBlob(text).sentiment.polarity
The function above is used to generate a polarity score. Remember that for polarity scores, '-1' is negative, '1' is positive and '0' is neutral.
Step 4.3: Converting Polarity Score to Sentiment
So we will create a function that will convert our polarity score to a sentiment:
def getSentiment(polarity_score):
if polarity_score < 0:
return 'Negative'
elif polarity_score == 0:
return 'Neutral'
else:
return 'Positive'
Step 4.4: Arrange Results in a Table
Now we will need to arrange the results from our data so that it's easy to understand and read, so we'll put it in a table using the Dataframe()
function.
sentiment_df = pd.DataFrame()
for post in tqdm(training_data_posts):
polarity = getPolarityScore(post)
sentiment = getSentiment(polarity)
sentiment_df = sentiment_df.append(pd.Series([round(polarity, 2), sentiment, post]), ignore_index=True)
sentiment_df.columns = ['Tweet_Polarity', 'Tweet_Sentiment', 'Tweet']
sentiment_df.head(10)
The head()
function will display the first ten rows of our table to give us a quick overview of the sentiment analysis results.
Output:
Step 4.5: Visualization of the Sentiment Results
The last step involves the visualization of our sentiment result, and we will use visualization libraries Seaborn and Matplotlib. We'll import the libraries into our Jupyter environment, as shown in the code below:
import seaborn as sns
import matplotlib.pyplot as plt
We need to create a function for visualizing the results, which is done with the code given below:
plt.figure(figsize=(8, 6))
sns.countplot(x='Review_Sentiment', data=sentiment_df)
plt.xlabel("Count per Sentiment")
plt.title("Count of sentiment in Dataset")
plt.show()
Output:
The graph for the sentiment analysis is shown below:
However, there is one big problem we haven't talked about until now - actually obtaining the data you need to analyze.
Obtaining Data for Sentiment Analysis
Sentiment analysis algorithms are highly dependent on the quality and context of the input data. If the training data is not representative of the real-world use cases, the algorithm may not perform well on new, unseen data.
Web scraping is a common means of extracting needed data, but that brings with it a few more challenges that one might encounter. We cover some of them in detail here.
Limitations of Conventional Web Scraping
In our current context, scraping to obtain data has the following limitations:
- Time Consuming; Creating a web scraper can be time-consuming, especially for a secure platform like Twitter. If you're starting from scratch, it will require hours of coding and debugging.
- Data accuracy: Twitter data extracted and used for sentiment analysis needs to be clean and structured for it to give accurate results. Tweets are anything but. They are unstructured and unsanitized and require a cleaning phase (refer to "Cleaning the Dataset").
- Security Issues: When trying to get data from a website. Your requests might get blocked or require some form of authentication to gain access to the website's data. This is especially true for our case, Twitter.
- Dynamic Websites: Twitter is a dynamic website built using complex APIs and client-side JavaScript, which may make the data you want not easily visible to web scraping tools. To scrape data from such a dynamic website, you typically use a tool like Selenium, which allows you to control a web browser from Python and interact with the page in a way that is similar to a human user. In short, this is another complex process which doesn't help your mileage.
To sum it up, obtaining the data we want for our use case depends on several factors and all of them need to click for us to obtain accurate results. This is entirely possible but it requires a lot of time, effort, resources and manpower for it to provide the desired output.
These limitations can be easily overcome in the following way. Let's take a look.
Best Way of Obtaining Data for Sentiment Analysis - Using an Advanced Web Scraper
One way to overcome these limitations is to use an advanced web scraper to collect high-quality, context-rich data for training and evaluating your sentiment analysis models.
One such advanced web scraping tool is Bright Data's Web Scraper IDE.
By using Bright Data's ready-made, purpose-built, regularly updated web scraper that is capable of handling dynamic websites with JavaScript, as well as bypassing limits and geoblocks...you can more easily access a wider range of data sources - including social media platforms (in our case, Twitter), forums, and other online communities - for performing sentiment analysis.
Advantages of Using Bright Data's Web Scraper IDE
In brief, the Web Scraper IDE provides the following benefits over a conventional web scraping approach:
- Provides ready-made functions and coding templates for a variety of data sources (including Twitter) that save a lot of time required to build a scraper from scratch.
- Extracts publicly available data and presents it in a neat and structured form. This removes the need for the data-cleaning step that we saw earlier in this article, thereby saving plenty of time and resources, and improving the accuracy and reliability of your models.
- Comes with in-built features that help overcome CAPTCHAS and other blocks (covered in more detail here and here).
- Accesses otherwise difficult-to-access public sites, being built on a patented proxy network infrastructure.
How the Web Scraper IDE Works
For the present context, we can use Bright Data's Twitter Scraper to gather data for sentiment analysis. Let's look at the specific features and how it helps with sentiment analysis.
First thing, after signing up, click on 'User Dashboard' at the top-right corner of Bright Data's home page.
Next, you'll see another page like this:
On this page click 'View data products' and then,
click on 'Get started' under 'Web Scraper IDE'.
This will take you to the next page which shows a popup which you might be familiar with from a previous article:
From this list of templates, choose 'Twitter hashtag search'. This fires up the Twitter Scraper.
And you can see below how this tool works:
First, input the hashtag of choice and hit 'Preview'. Here we go with the "#ClevelandPolice" hashtag.
The scraper begins work, you can follow the 'Run log' tab while the scraper runs.
After a while, the scraper is finished running and you can find the output in the 'Output' tab. Data is neatly arranged in a table, which you can download in JSON format.
The data obtained is clean and structured and 99.9% accurate and ready for analysis. As you can see, with this advanced scraping tool, you can avoid the entire cleaning process and dive straight into sentiment analysis, with your library of choice.
What's more, Bright Data provides 24/7 support if we require any help with data scraping or using their ready-made datasets. Their team of developers are always available to help out users.
Conclusion
Sentiment analysis depends on the accuracy of the dataset. Obtaining datasets via web scraping can be a laborious process, involving a lot of hassles - legal or technical - which can be avoided by using an advanced scraper like Bright Data's Web Scraper IDE. Once accurate data is obtained this way, we can use one of many well-known Python libraries for sentiment analysis.