1. Introduction
In this tutorial, we’ll show an introduction to sentiment analysis. We’ll first see a definition, then we’ll discuss the types of sentiment analysis and associated calculation techniques.
Finally, we’ll code a Python example of one of the approaches we’ve presented.
2. What Is Sentiment Analysis?
2.1. Sentiment Defined
We can define sentiment as the attitude, mental feeling, or emotion towards something or someone. Most probably, the sentiment is expressed in the form of a review or an opinion, whether it’s positive or negative.
When someone says, “I didn’t like the latest version of Android OS”, then he has a bad or negative feeling when reviewing the latest version of Android OS.
2.2. Sentiment Analysis Defined
Sentiment analysis refers to the use of the algorithms and tools of natural language processing (NLP), computational linguistics, machine learning, and statistics, to determine the sentiment orientation in a text. Sentiment analysis is also known as opinion mining or emotion analysis.
The two basic sentiment orientations are positive and negative. When we’re dealing with only these, we say it’s a two-class sentiment analysis problem.
Some other problems add neutral to the mix to make it a three-class problem. We assign the neutral class to statements that don’t hold any opinion, like stating facts, news, and religious texts. For example, “COVID-19 active cases are increasing”, is a neutral statement without any opinion.
With the rise of neural networks and deep learning algorithms, the sentiment analysis problem requires additional and more sophisticated classes. Instead of the basic two or three classes, we have classes like angry, sad, fearful, happy, and surprised.
In this tutorial, we’ll stick to the basic three classes: positive, negative, and neutral.
2.3. Usage
The main commercial usage of sentiment analysis is monitoring users’ opinions in social media pages and accounts, product reviews, hotel reviews, and non-numeric stock market analysis. This helps manufacturers, services providers, and public figures to be proactive towards the public opinions of their followers and consumers.
As a business example, the ability to quickly understand consumer attitudes and react accordingly is something that Expedia Canada took advantage of in 2017. They noticed that there was a steady increase in negative feedback to the music used in one of their television advertisements.
3. Types of Sentiment Analysis
3.1. Subjectivity Identification
In this type, the task is to determine the sentiment class in the given text. Text is usually a sentence, a tweet, or a small paragraph.
The statement, “Sadly, my flight has been canceled”, holds a negative sentiment. However, “I enjoyed riding the new electric scooter”, holds a positive sentiment.
3.2. Aspect-Based
In this type, the text has different sentiments for different aspects or parts of the same thing or the same person.
Let’s explain it with an example product review: “After one year of using this mobile, I can say it has an excellent full HD screen and outstanding battery, but the camera is poor especially in low light conditions.”
In that case, the user writes a review about a mobile device, but the review has three different aspects or parts — screen, battery, and camera, each with its own sentiment. For the screen and the battery, the sentiment is positive, but for the camera, the sentiment is negative.
This type of sentiment analysis usually comes with product and hotel reviews, and we can’t conclude a public sentiment for the reviewed object.
4. Sentiment Analysis Methods
4.1. Lexicon-Based Approach
In this approach, we use a pre-defined dictionary of positive and negative works.
When we calculate the aggregated sentiment for the text, we divide the text into tokens or words, then match these tokens with the dictionary.
Also, we initiate a counter variable with zero. If a positive word is encountered, we increase the counter by one, and when a negative word is encountered, we decrease the counter by one.
Finally, if the counter value is positive, the aggregated sentiment is positive. Likewise, when the counter value is negative, the aggregated sentiment is negative. If the counter value is zero, we say that the text has a neutral sentiment.
This approach is very basic and not widely used in industry-level solutions. In fact, we couldn’t use it in aspect-based sentiment analysis. Most of the time, we use this approach for verification.
4.2. Supervised Approach
The supervised approach is a machine learning-based approach that depends on an annotated dataset.
An annotated dataset means that we have a file containing two columns: one for the text, and the other one for its sentiment. Then we feed this file to a machine learning pipeline. That pipeline does four main steps, but isn’t limited to them:
- Some preprocessing steps to the text — mainly tokenization, stemming, and stop-words removal.
- Text representation: Texts are not understandable by machine learning algorithms, so we have to find a numeric and vectorized representation for the text. A lot of text representations are available, but TF-IDF representation is the most common.
- Train the model using the data we got from steps 1 and 2.
- Test the model to make sure it performs well, given some data that wasn’t used in the training step.
This approach is the most widely used in industry and production-ready solutions. We’ll have a full example of it in Python later.
4.3. Hybrid Approach
In this approach, we combine the previous two approaches. Both the dictionary and the annotated dataset are required. First, we follow the supervised approach, then we use the lexicon approach to verify the results.
We use the hybrid approach only if we have the traditional three classes: positive, negative, and neutral.
5. Sentiment Analysis in Action
Here, we present a Python script that performs a full sentiment analysis using the supervised approach. The dataset used in this script is “US Airline Tweets”:
import pandas as pd
import re
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import random
import sklearn.model_selection as model_selection
# load data from csv file into Pandas data frame
tweets_df = pd.read_csv('us_airline_tweets.csv')
# print info to check columns types, rows count, and if there are null values in the columns
print(tweets_df.info())
# drop all columns except the sentiment and tweet text columns
tweets_df.drop(tweets_df.columns.difference(['airline_sentiment', 'text']), 1, inplace=True)
# drop duplicated rows
print(tweets_df.drop_duplicates())
# check for null values in the final needed columns
print(tweets_df.isnull().sum())
# some basic pre-processing
# remove links
tweets_df['text'] = tweets_df['text'].apply(lambda x: re.sub(r"(www|http|https|pic)([a-zA-Z\.0-9:=\\~#/_\&%\?\-])*", ' ', x))
# remove mention symbol
tweets_df['text'] = tweets_df['text'].apply(lambda x: x.replace('@', ''))
# remove hashtag symbol
tweets_df['text'] = tweets_df['text'].apply(lambda x: x.replace('#', ''))
# convert all text to lower case (this helps in vectorization and training)
tweets_df['text'] = tweets_df['text'].apply(lambda x: x.lower())
# split dataset
X_train, X_test, y_train, y_test = model_selection.train_test_split(tweets_df['text'],
tweets_df['airline_sentiment'],
train_size=0.80, test_size=0.20, random_state=101)
# create a pipeline of two steps
# first: tf-idf vectorizer to represent textual data in numeric vectors
# second: MultinomialNB classifier, which is the scikit-learn implementation of Multinomial Naive Bayes
pipeline = Pipeline([
('vect', TfidfVectorizer(min_df=0.0001, max_df=0.95, analyzer='word', lowercase=True, ngram_range=(1, 3), stop_words='english')),
('clf', MultinomialNB()),
])
# train the model
pipeline.fit(X_train, y_train)
feature_names = pipeline.named_steps['vect'].get_feature_names()
# test the model
y_predicted = pipeline.predict(X_test)
# print the classification report
print(metrics.classification_report(y_test, y_predicted))
# print number of features and some samples
print('# of features:', len(feature_names))
print('sample of features:', random.sample(feature_names, 40))
# calculate and print the model testing metrics
accuracy = accuracy_score(y_test, y_predicted)
precision = precision_score(y_test, y_predicted, average='weighted')
recall = recall_score(y_test, y_predicted, average='weighted')
f1 = f1_score(y_test, y_predicted, average='weighted')
print('Accuracy: ', "%.2f" % (accuracy*100))
print('Precision: ', "%.2f" % (precision*100))
print('Recall: ', "%.2f" % (recall*100))
print('F1: ', "%.2f" % (f1*100))
When we execute the Python script, it’ll print several metrics:
- Accuracy: 69.60
- Precision: 72.80
- Recall: 69.60
- F1: 63.07
These results aren’t good enough for a production solution, but we can get an idea of the techniques used. With more pre-processing steps and tuning of model hyperparameters, we’ll get better results.
6. Conclusion
In this tutorial, we showed the definition, usage, types, and approaches of sentiment analysis.
We also showed a complete Python application that trains a supervised model for three-class sentiment classification.