Sentiment Analysis Based On Roberta For Amazon Review: An Empirical Study On Decision Making
Sentiment Analysis Based On Roberta For Amazon Review: An Empirical Study On Decision Making
by
Xinli GUO
arXiv:2411.00796v1 [cs.LG] 18 Oct 2024
Master of Science
Xinli GUO
2
Abstract
i
Attention Is All You Need.
ii
Contents
1 Introduction 1
2 Literature Review 3
2.1 NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Deep Learning in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Transformer-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Applications in Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . 5
2.5 Behavioral Economics in Sentiment Analysis . . . . . . . . . . . . . . . . 5
3 Research Model 7
3.1 NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Basic Concepts of NLP . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.2 Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . 10
3.3 BERT: Bidirectional Encoder Representations from Transformers . . . . 12
3.3.1 Masked Language Model (MLM) . . . . . . . . . . . . . . . . . . 12
3.3.2 Next Sentence Prediction (NSP) . . . . . . . . . . . . . . . . . . . 13
3.3.3 Total Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.5 Differences Between BERT and Transformer . . . . . . . . . . . . 15
3.4 RoBERTa: A Robustly Optimized BERT . . . . . . . . . . . . . . . . . 15
3.4.1 Improvements Over BERT . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.3 RoBERTa: Key Pre-Training Enhancements . . . . . . . . . . . . 17
3.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.5 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Method 20
4.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Computing Environment . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . 22
5 Result and Analysis 29
5.1 Data Processing for Data Analysis . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Analysis and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 Case: B00R1TAN7I . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Further Analysis and Discussion . . . . . . . . . . . . . . . . . . . 35
6 Discussion 43
6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Implications for Theory and Practice . . . . . . . . . . . . . . . . . . . . 43
6.2.1 Theoretical Implications . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Limitations and Future Research Directions . . . . . . . . . . . . . . . . 45
iii
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
iv
List of Tables
v
List of Figures
vi
Chapter 1
Introduction
In recent years, the field of Natural Language Processing (NLP) has witnessed significant
advancements, particularly with the development of transformer-based models. These
models, such as RoBERTa and DistilBERT, have demonstrated remarkable capabilities
in various NLP tasks, including sentiment analysis. Sentiment analysis, the process of
determining the emotional tone behind a body of text, is crucial for understanding con-
sumer opinions and behaviors. In this study, we employ RoBERTa to perform sentiment
analysis on Amazon product reviews, aiming to derive sentiment scores that reflect the
underlying emotions expressed in the reviews.
In our research, we utilize these models to analyze a substantial dataset of Amazon prod-
uct reviews. By applying these state-of-the-art NLP techniques, we generate sentiment
scores for each review, quantifying the positivity or negativity of the expressed sentiments.
This allows us to evaluate the accuracy of sentiment scores produced by RoBERTa
Beyond merely obtaining sentiment scores, our study delves into data analysis and vi-
sualization to observe patterns and trends in review sentiments. Through this, we ex-
plore how these sentiment scores align with principles of behavioral economics, such as
electronic word-of-mouth (eWOM), consumer emotional reactions, and the confirmation
bias. eWOM refers to the influence of online user-generated content on consumer deci-
sions, consumer emotional reactions describe how emotions affect purchasing behaviors,
1
and confirmation bias highlights how individuals tend to favor information that confirms
their preexisting beliefs.
By integrating advanced NLP techniques with behavioral economics theories, this re-
search not only provides insights into consumer sentiment on Amazon but also demon-
strates the broader applicability of transformer-based models in understanding complex
human behaviors. The findings of this study have significant implications for businesses
and marketers aiming to leverage sentiment analysis for strategic decision-making and
consumer engagement.
2
Chapter 2
Literature Review
2.1 NLP
Natural Language Processing (NLP) has evolved significantly over the past few decades,
transitioning from rule-based systems to more sophisticated machine learning techniques.
Early sentiment analysis approaches relied heavily on lexicon-based methods, where pre-
defined dictionaries of positive and negative words were used to classify text. While
these methods were straightforward, they often struggled with context and nuances in
language.
Challenge Description
Word Sense A single word can have multiple meanings depending
Disambiguation on the context. For example, the word ”bank” can
mean a financial institution or the side of a river.
3
Ambiguity For example, ”I saw the man with a telescope” can
mean either ”I used a telescope to see the man” or ”I
saw a man who had a telescope.”
Syntactic Analysis Different languages have different syntactic rules, and
parsing long and complex sentences can be difficult.
Semantic Analysis Requires understanding implicit semantics and
context. For example, ”He sold the car he bought last
week” implies understanding the temporal relationship.
Classical machine learning techniques, such as Naive Bayes and Support Vector Machines,
brought improvements by learning from labeled datasets. These methods could capture
some context and were more flexible than lexicon-based approaches. However, they still
had limitations, particularly in handling complex linguistic structures and long-range
dependencies in text.
4
tions from Transformers), which utilized bidirectional context to achieve state-of-the-art
performance in various NLP tasks.
Specific to product reviews, researchers have utilized these models to gain insights into
consumer opinions. Liu et al. (2020) applied RoBERTa to Amazon reviews, highlighting
its effectiveness in capturing nuanced sentiments and outperforming older techniques.
These studies underline the models’ capabilities in understanding and interpreting com-
plex emotional expressions in text.
Sentiment analysis has been a valuable tool in studying these phenomena. For example,
research by Hu et al. (2014) demonstrated how sentiment trends in online reviews could
predict consumer purchasing behavior, illustrating the snowball effect. Similarly, studies
on the herd effect have used sentiment analysis to show how positive or negative reviews
5
can influence subsequent reviewers’ sentiments (Liu & Zhang, 2019).
6
Chapter 3
Research Model
3.1 NLP
3.1.1 Basic Concepts of NLP
Syntactic Analysis
Syntactic analysis involves breaking down sentences into their components and under-
standing the grammatical relationships between them. It includes both syntactic parsing
and semantic analysis:
Syntactic Parsing
Part-of-Speech Tagging (POS Tagging) Assigns a part of speech (like noun, verb,
adjective) to each word in a sentence. Syntax Tree Construction, builds a tree structure
to represent the grammatical structure of a sentence, showing the relationships between
words.
Semantic Analysis
Named Entity Recognition (NER) Identifies entities in the text, such as names of
people, places, organizations, etc.
Semantic Role Labeling (SRL) Labels the roles words play in the sentence’s mean-
ing, such as the agent and patient of an action.
For the sentence ”The cat sleeps on the table,” syntactic analysis would identify ”cat”
as a noun (subject), ”sleeps” as a verb (predicate), and ”on the table” as a prepositional
phrase indicating location.
7
Word Vector
Word vector representation converts words into numerical vectors so that computers can
process and understand them. These vectors capture semantic relationships between
words and are used in various NLP tasks.
Bag of Words (BoW) Represents text as vectors of word counts, ignoring word order
and grammatical structure.
Word Embeddings Maps words into a continuous vector space, where semantically
similar words are closer together. Common methods include Word2Vec and GloVe.
Contextual Word Representations Uses deep learning models (like BERT, ELMo)
to generate word vectors that take into account the word’s context within a sentence.
3.2.1 Architecture
The Transformer model consists of an Encoder and a Decoder. Both components are
composed of stacked layers, each containing sub-layers such as self-attention mechanisms
and feed-forward neural networks.
Encoder: The encoder is responsible for processing the input sequence and generating
a set of representations. The encoder consists of a stack of six identical layers (N = 6).
Each layer contains two sub-layers:
LayerNorm(x + Sublayer(x))
8
where Sublayer(x) represents the function implemented by the sub-layer. To support
these residual connections, all sub-layers and the embedding layers in the model produce
outputs with a fixed dimension of dmodel = 512
Decoder: The decoder generates the output sequence, typically used for tasks like
machine translation. The decoder also consists of a stack of six identical layers (N = 6).
Each decoder layer contains three sub-layers:
9
3.2.2 Self-Attention Mechanism
Self-attention allows the model to weigh the importance of different words in a sentence
when encoding a word. For instance, in the sentence “The cat sat on the mat,” the
word “cat” has a strong connection with “sat” and “mat.” The self-attention mechanism
allows the model to consider these relationships simultaneously rather than sequentially,
improving the handling of long-range dependencies and context.
1. Input Representation: Each word in the input sequence is converted into three vectors:
Query (Q), Key (K), and Value (V) using learned weight matrices.
Q = XWQ ,
K = XWK , (3.1)
V = XWV
2. Scaled Dot-Product Attention: The attention scores are computed using the dot
product of the query and key vectors, scaled by the square root of the dimension of
the key vectors. These scores are then passed through a softmax function to obtain the
attention weights.
QK T
Attention(Q, K, V ) = softmax √ V
dk
3. Output: The weighted sum of the value vectors produces the output of the self-
attention mechanism.
Multi-Head Attention
Multi-head attention enhances the model’s ability to focus on different parts of the in-
put sequence simultaneously by applying multiple attention mechanisms in parallel. It
Enables the model to capture diverse patterns and dependencies within the data, also
improves the capacity to model complex relationships in the input.
10
Figure 3.2: Scaled Dot-Product Attention
11
3.3 BERT: Bidirectional Encoder Representations from
Transformers
BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP
by introducing bidirectional context understanding. Traditional models like RNNs and
earlier transformers considered context either from left-to-right or right-to-left, but not
both simultaneously. BERT, however, reads the entire sentence at once, understand-
ing the context from both directions.It is a pre-trained language model based on the
Transformer architecture, using the Masked Language Model (MLM) and Next Sentence
Prediction (NSP) tasks and then fine-tuned for specific downstream tasks.
BERT uses the encoder part of the Transformer architecture, which consists of multi-head
self-attention mechanisms and feed-forward neural networks.
Masking
Randomly select some tokens in the input sequence and replace them with a [MASK]
token. For a masked position i, the objective is to maximize the probability:
Loss Function
Use cross-entropy loss to measure the difference between the predicted and actual tokens.
The cross-entropy loss for predicting the masked token wi is given by:
X
LMLM = − log P (wi | wcontext )
i∈mask
where:
• P (wi | wcontext ) is the predicted probability of the masked token wi given the
context.
12
3.3.2 Next Sentence Prediction (NSP)
In the NSP task, the model is trained to predict whether two sentences are consecutive
in the original text. The cross-entropy loss for NSP is given by:
LNSP = − y log P (IsNext | w[CLS] ) + (1 − y) log P (NotNext | w[CLS] )
where:
• P (IsNext | w[CLS] ) is the predicted probability that the second sentence is the actual
next sentence.
Sentence Pairs
Construct pairs of sentences (A, B) where 50% of the time B is the actual next sentence
following A, and 50% of the time B is a random sentence.
Classification Task
Use the representation of the [CLS] token to perform a binary classification task, aiming
to maximize the probability:
P (IsNext | [CLS])
Loss Function
Use cross-entropy loss to measure the difference between the predicted and actual la-
bels.
L = LMLM + LNSP
13
Figure 3.4: Overall pre-training and fine-tuning procedures for BERT.
where LMLM is the loss from the masked language model, and LNSP is the loss from the
next sentence prediction.
BERT is trained on a large corpus using unsupervised tasks like Masked Language Model
(MLM) and Next Sentence Prediction (NSP). MLM will Randomly mask some tokens
in the input and trains the model to predict the masked tokens based on the context,
While NSP Trains the model to understand the relationship between two sentences by
predicting if a given sentence pair follows each other in the text.
Fine-tuning
14
Table 3.1: Comparison of BERT and Transformer
Aspect Transformer BERT
Architecture - Encoder-Decoder struc- - Uses only the encoder part
ture of the Transformer
- Encoder: stack of identical - Encoder: stack of identical
layers with multi-head self- layers with multi-head self-
attention and feed-forward attention and feed-forward
neural network neural network
- Decoder: stack of identical - Bidirectional processing
layers with multi-head self-
attention, encoder-decoder
attention, and feed-forward
neural network
Training Ob- - Typically trained for spe- - Pre-training: Masked Lan-
jectives cific tasks like translation guage Model (MLM) and
using teacher forcing Next Sentence Prediction
- Objective: minimize loss (NSP)
for predicting the next word - Fine-tuning: on specific
in sequence given previous downstream tasks with la-
words and source sentence beled data
Directionality - Unidirectional in decoder - Bidirectional: processes
(left-to-right) the entire sequence of words
simultaneously
Use Cases and - Machine Translation - Understanding and classi-
Applications - Text Generation fying text
- Sequence-to-sequence ap- - Sentiment Analysis
plications - Named Entity Recognition
- Question Answering
- Information Retrieval
• Training Data Size: RoBERTa uses a significantly larger dataset, which includes
additional data from Common Crawl and OpenWebText.
15
Table 3.2: Quantitative Improvements of RoBERTa over BERT
Characteristic BERT RoBERTa
Training Data Size 16GB 160GB
Training Steps 1 million 500,000
Batch Size 256 8,192
Learning Rate 1e−4 1e−4 with warm-up
Masking Strategy Static Masking Dynamic Masking
Next Sentence Prediction (NSP) Yes No
• Training Steps: BERT is trained for 1 million steps, whereas RoBERTa is trained
for 500,000 steps.
• Batch Size: RoBERTa uses a much larger batch size of 8,192 compared to BERT’s
256.
• Learning Rate: Both use a learning rate of 1e−4 , but RoBERTa includes a warm-
up period.
• Masking Strategy: BERT uses static masking, while RoBERTa uses dynamic
masking, changing the masking pattern each epoch.
• Next Sentence Prediction (NSP): RoBERTa removes the NSP task, focusing
solely on the MLM task.
• Training Time: RoBERTa is trained for a longer period with more iterations
compared to BERT.
• Pre-training Tasks: BERT is pre-trained on both MLM and NSP tasks, while
RoBERTa is pre-trained only on the MLM task.
3.4.2 Architecture
RoBERTa retains the same architecture as BERT:
16
• Layer Normalization and Residual Connections: Stabilizes and enhances the
training process.
17
Table 3.3: Pre-Training Enhancements in RoBERTa Compared to BERT
Aspect Description and Examples
Pre-Training Task RoBERTa: Uses only the Masked Language Model (MLM) task.
Simplifies the training objective and improves the model’s under-
standing of context and semantics.
BERT: Uses both MLM and Next Sentence Prediction (NSP)
tasks.
Example for MLM:
Original Sentence: ”The quick brown fox jumps over the lazy
dog.”
Masked Sentence: ”The quick brown [MASK] jumps over the
[MASK] dog.”
Training Objective: Predict ”fox” and ”lazy” based on the con-
text.
Dynamic Masking RoBERTa: Generates a new masking pattern for each sequence
during training, enhancing the model’s ability to generalize by ex-
posing it to a variety of masking patterns.
BERT: Uses static masking, where the masking pattern is fixed
and reused throughout training.
Example for Dynamic Masking:
First Epoch: ”The quick brown [MASK] jumps over the lazy
[MASK].”
Second Epoch: ”The [MASK] brown fox jumps [MASK] the
lazy dog.”
Benefit: Allows the model to learn robust contextual represen-
tations by varying the masking patterns.
18
3.4.4 Performance
3.4.5 Generalization
RoBERTa exhibits better generalization capabilities due to:
Larger and more diverse training data. Training on a larger and more varied dataset
allows RoBERTa to encounter a wider variety of linguistic contexts, improving its ability
to generalize to new, unseen data. The vast amount of data ensures that the model can
learn from more examples, reducing the likelihood of overfitting to specific patterns in
the training data.
Longer training with more iterations. RoBERTa is trained with more iterations and
longer training times compared to BERT. While BERT was trained for 1 million steps,
RoBERTa uses a more extensive training schedule, allowing the model to converge better.
More training iterations enable the model to better capture the underlying data distribu-
tion, leading to improved performance on downstream tasks. And longer training allows
RoBERTa to refine its internal representations, making them more robust and effective
at generalizing to new tasks
Removal of the NSP task, which focuses the model on a single, more effective pre-training
task. NSP has been found to be less relevant for many downstream tasks. Removing it
helps the model to avoid learning spurious correlations that do not generalize well.
19
Chapter 4
Method
User Reviews
20
Item Metadata
First, we merge the two datasets using parent asin as the primary key and construct
the following new features:
Operation Description
Calculating Review This line calculates the length of each review in the text
Length column and stores the result in a new column called
review length.
Filling Missing Help- This line fills any missing values (NaN) in the
ful Vote Values helpful vote column with 0.
Converting Verified Using a lambda function, this line converts the boolean
Purchase to Binary values (True/False) in the verified purchase column
Variable to integers (1 or 0). True is converted to 1, and False is
converted to 0.
Checking for Images Using a lambda function, this line converts the values
in the images column to a binary variable. If there are
images (non-empty value), it is set to 1; otherwise, it is
set to 0.
Converting Times- This line converts the timestamp column from millisec-
tamp to Datetime onds since epoch to a datetime object, allowing for fur-
ther time-related operations.
Extracting Year This line extracts the year from the timestamp column
and stores it in a new column called year.
Extracting Month This line extracts the month from the timestamp col-
umn and stores it in a new column called month.
Extracting Day This line extracts the day from the timestamp column
and stores it in a new column called day.
21
Operation Description
Extracting Weekday This line extracts the weekday (0-6, representing Mon-
day to Sunday) from the timestamp column and stores
it in a new column called weekday.
1
2 from transformers import Au toMode lForSe quence Classi ficati on ,
AutoTokenizer
3 import torch
4 import numpy as np
5 from tqdm import tqdm
6 from concurrent . futures import Thr ea dP oo lE xe cu to r
7
8 # Load the RoBERTa pre - trained model and the corresponding tokenizer .
9 # Load the model onto the CUDA device ( GPU ) for inference .
10
11 model_name = " Proggleb / roberta - base - bne - finetuned - a m a z o n _ r e v i e w s _ m u l t i "
12 tokenizer = AutoTokenizer . from_pretrained ( model_name )
13 model = A u t o M o d e l F o r S e q u e n c e C l a s s i f i c a t i o n . from_pretrained ( model_name ) .
to ( ’ cuda ’)
14
15 # The truncate_text function is used to truncate text that exceeds the
maximum length , which will help me decrease GPU memory and RAM usage
, speed up the training
16 def truncate_text ( text , max_length =512) :
17 return text [: max_length ]
18 # The p r e p r o c e s s _ a n d _ a n a l y z e function tokenizes the text , encodes it ,
and feeds it into the model to compute sentiment scores .
22
19 # The softmax function is used to compute the probability of each
sentiment class , and the difference between the positive and
negative sentiment scores is calculated .
20 def p r e p r o c e s s _ a n d _ a n a l y z e ( texts ) :
21 inputs = tokenizer ( texts , padding = True , truncation = True , max_length
=512 , return_tensors = " pt " ) . to ( ’ cuda ’)
22 with torch . no_grad () :
23 logits = model (** inputs ) . logits
24 scores = torch . nn . functional . softmax ( logits , dim = -1)
25 sentiments = scores [: , 1] - scores [: , 0] # POSITIVE score -
NEGATIVE score
26 return sentiments . cpu () . numpy ()
27 # The p a r al l e l_ p r oc e s si n g function processes each batch of text data ,
calls the p r e p r o c e s s _ a n d _ a n a l y z e function for analysis , and returns
the sentiment scores . Same purpose of Truncation , it allows me to
mapping function to processing data by 20 units per batch .
28 def p ar a l le l _ pr o c es s i n g ( batch_texts ) :
29 batc h_pred ictio ns = p r e p r o c e s s _ a n d _ a n a l y z e ( batch_texts )
30 return np . array ( ba tch_pr edicti ons )
31
32 batch_size = 20
33 all_predictions = []
34
35 with T hr ea dP oo lE xe cu to r ( max_workers =8) as executor :
36 futures = []
37 for i in range (0 , len ( df ) , batch_size ) :
38 batch_texts = df [ ’ text ’ ][ i : i + batch_size ]. tolist ()
39 futures . append ( executor . submit ( parallel_processing , batch_texts
))
40
41 for future in tqdm ( futures , desc = " Processing batches " ) :
42 all_predictions . extend ( future . result () )
43
44 df [ ’ sentiment_score ’] = all_predictions
45 df . to_json ( ’ s e n t i m e n t _ a n a l y s i s _ r e s u l t s _ r o b e r t a . jsonl ’ , orient = ’ records ’
, lines = True )
46 print ( df . head () )
Listing 4.1: Training Process
23
16 return np . array ( ba tch_pr edicti ons )
17
18 # Define the batch size for processing
19 batch_size = 20
20 # Initialize an empty list to store all predictions
21 all_predictions = []
22
24
Evaluation
For evaluation section, we define the task of sentiment classification as a binary classifica-
tion problem, categorizing emotions into positive and negative. Based on the rating star
labels, we convert the emotions into different categories: 1 to 3 stars represent negative
emotions, while 4 to 5 stars represent positive emotions. By evaluating the distribution
of the model’s F1 scores under different thresholds, we determined the optimal threshold
to be -0.6. Accordingly, we set the rating range to [-1, -0.6] for negative emotions and
[-0.6, 1] for positive emotions.
TP + TN
Accuracy =
TP + TN + FP + FN
Explanation: The proportion of correct predictions made by the model. The provided
accuracy is 0.8844, indicating that the model correctly predicts 88.44% of the cases.
25
Figure 4.2: Confusion Matrix at Best Threshold
TP
P recision =
TP + FP
Explanation: The proportion of true positive predictions among all positive predictions
made by the model. The provided precision is 0.8836.
TP
Recall =
TP + FN
Explanation: The proportion of true positive predictions among all actual positive cases.
The provided recall is 0.8844.
P recision × Recall
F 1Score = 2 ×
P recision + Recall
Explanation: The harmonic mean of precision and recall, providing a balance between
the two metrics. The provided F1 score is 0.8840.
In a binary classification problem, the construction of the ROC curve depends on the False
26
Class Precision Recall F1-score Support
Negative 0.81 0.79 0.80 201421
Positive 0.91 0.92 0.92 500107
Accuracy 0.88 701528
Macro avg 0.86 0.86 0.86 701528
Weighted avg 0.88 0.88 0.88 701528
Positive Rate (FPR) and True Positive Rate (TPR) calculated at different thresholds.
Below are the specific steps for calculating FPR and TPR:
False Positive Rate (FPR): The proportion of actual negative (negative) samples incor-
rectly predicted as positive (positive). The formula is:
FP
FPR =
FP + TN
where FP (False Positive) is the number of false positives, and TN (True Negative) is the
number of true negatives.
True Positive Rate (TPR): The proportion of actual positive (positive) samples correctly
predicted as positive (positive). The formula is:
TP
TPR =
TP + FN
27
where TP (True Positive) is the number of true positives, and FN (False Negative) is the
number of false negatives.
For each possible threshold (from 0 to 1), calculate the FPR and TPR at that threshold.
At each threshold:
If the model’s predicted score is greater than or equal to the threshold, predict as positive
(positive). Otherwise, predict as negative (negative).
The Area Under the Curve (AUC) is the total area under the ROC curve. The closer the
AUC value is to 1, the better the model’s discriminative ability.
Using the FPR and TPR to construct the curve, calculate the AUC using the trapezoidal
rule. The formula for the trapezoidal rule is:
n−1
X TPRi+1 + TPRi
AUC = × (FPRi+1 − FPRi )
i=1
2
The AUC value is 0.93, indicating that the model performs very well in distinguishing
between negative and positive sentiments.
28
Chapter 5
29
Table 5.2 – continued from previous page
Aspect Detailed Explanation
H2: Long Sales Cycle Quality Consistency: The product maintains stable quality
with Stable Quality throughout its sales cycle, providing consistent user experi-
ence and not significantly altering review sentiments due to
quality fluctuations.
Long-Term Reputation Impact: Stable long-term quality
helps build a good reputation, making user reviews a more
accurate reflection of the product’s true quality.
H3: Basic Daily Ne- Low Impact of Technological Progress: Due to low technologi-
cessities with Low cal content, advancements in technology have minimal impact
Technological Content on product quality and sales volume. Users focus more on the
product’s practicality and cost-effectiveness.
Stable User Expectations: Users have stable expectations for
basic daily necessities, which are not significantly altered by
technological advancements, making review sentiments more
reflective of the actual usage experience.
Next, we will conduct an analysis case randomly draw from one of the four products.
30
5.2 Analysis and Solution
5.2.1 Case: B00R1TAN7I
Empirical Inference From the graph, we can see that the average sentiment score
fluctuates significantly in different periods. It can be roughly divided into the following
stages:
31
Time Period Description
2015 to Early 2017 Sentiment scores fluctuate significantly, with many
negative scores, and the overall average score is
relatively low.
Product Quality Issues: There may have been
quality issues during this period, leading to many
negative reviews.
Market Adaptation Period: The product was new
to the market, and there might have been a significant
gap between user expectations and actual experiences.
Consumer Sentiment Response: Consumers’
emotional responses directly influence their reviews,
and concentrated negative sentiment can lead others to
give negative reviews.
Mid-2017 to 2018 Sentiment scores rise, generally trending positive, but
there are still large fluctuations.
Product Improvements: The manufacturer may
have improved the product during this period,
increasing user satisfaction.
Positive Promotion: Effective marketing and
promotion may have led to more positive reviews.
Electronic Word-of-Mouth (eWOM):
Concentrated bursts of negative reviews and their
spread may lead to significant drops in sentiment
scores, while the spread of positive reviews can cause
scores to rise.
Late 2018 to Early 2019 Average sentiment scores drop significantly, with a
notable increase in negative sentiment scores.
Quality Issues or Service Failures: There might
have been major quality issues or service failures,
leading to a surge in negative reviews.
Negative Word-of-Mouth Effect: Negative reviews
spread quickly through electronic word-of-mouth
(eWOM), further amplifying negative sentiment.
Confirmation Bias: Users may be inclined to make
reviews consistent with existing sentiment scores,
amplifying a particular sentiment trend.
Continued on next page
32
Time Period Description
2019 to 2020 Sentiment scores gradually recover, but fluctuations
remain noticeable.
Quality Improvements: The manufacturer may
have implemented a series of improvements after
identifying issues, gradually regaining user trust.
Market Competition: Intense market competition
may have brought more user experiences, but also
more review volatility.
Anchoring Effect: Consumers may be influenced by
previous reviews, forming expectations about the
product that affect their subsequent sentiment scores,
explaining sustained high or low scores during certain
periods.
2021 to 2022 Sentiment scores rise again but do not reach previous
peaks, with many negative sentiments still present.
Stabilization Period: The product and service
became more stable, and sentiment scores gradually
rose, but previous negative impacts could not be
entirely eliminated.
Market Saturation: The market reached saturation,
and differences in expectations between new and
existing users might have resulted in more negative
sentiments.
Electronic Word-of-Mouth (eWOM):
Concentrated bursts of negative reviews and their
spread may lead to significant drops in sentiment
scores, while the spread of positive reviews can cause
scores to rise.
Late 2022 to Early 2023 Average sentiment scores drop again, and fluctuations
decrease, tending towards a stable negative trend.
Reoccurrence of Issues: There might have been
recurring product quality or service issues, leading to a
drop in sentiment scores.
Lowered User Expectations: Due to previous
negative impacts, user expectations may have lowered,
resulting in more negative reviews even without major
issues.
Confirmation Bias: Users may be inclined to make
reviews consistent with existing sentiment scores,
amplifying a particular sentiment trend.
33
Solution
Strategy Detailed Implementation
Monitor and Re- Establish a Data Platform: Set up a centralized data platform
spond that collects and processes user feedback in real-time. Integrate
data from various sources such as online reviews, social media com-
ments, and customer service interactions.
Implement Anomaly Detection Mechanisms: Develop algo-
rithms to detect significant deviations in sentiment scores. When a
cluster of negative sentiment is detected, the system should auto-
matically trigger alerts.
Real-Time Tracking and Response: Create dashboards that
provide real-time insights into user sentiment trends. Set up a ded-
icated team to monitor these dashboards and respond to negative
feedback promptly.
Targeted Product Optimization: Analyze negative feedback to
identify specific issues with the product or service. Use this infor-
mation to make targeted improvements. For instance, if many users
report a specific functionality issue, prioritize fixing it in the next
update.
Promotion and Identify Peak Sentiment Periods: Use historical data to iden-
Marketing tify periods when sentiment scores are typically high. Plan promo-
tional campaigns during these times to maximize their impact.
Highlight Positive Reviews: During marketing campaigns,
showcase positive user reviews and testimonials. This can enhance
credibility and attract more consumers.
Engage Influencers: Partner with influencers who have a posi-
tive view of your product. Their endorsements can amplify positive
sentiment and reach a wider audience.
Offer Incentives for Positive Feedback: Encourage satisfied
customers to leave positive reviews by offering incentives such as
discounts or loyalty points.
34
Table 5.4 – continued from previous page
Strategy Detailed Implementation
Continuous Im- Regularly Collect Feedback: Use surveys, user interviews, and
provement feedback forms to gather continuous input from users. Ensure that
feedback collection is an ongoing process rather than a one-time
event.
Analyze Feedback for Insights: Use text analytics and senti-
ment analysis tools to extract actionable insights from user feed-
back. Identify common themes and recurring issues.
Implement a Feedback Loop: Establish a process for turn-
ing user feedback into actionable improvements. Prioritize changes
based on the impact on user satisfaction and the feasibility of im-
plementation.
Test and Iterate: Before rolling out major changes, test them
with a small user group to gather feedback and refine the improve-
ments. This iterative process ensures that the changes meet user
needs.
Communicate Improvements to Users: When you make im-
provements based on user feedback, communicate these changes to
users. This shows that you value their input and are committed to
enhancing their experience.
Based on the regression results, there is indeed a negative correlation between average
sentiment score and purchase count, but this negative correlation is not significant. The
correlation coefficient is -0.140007, and the R-squared value from the regression analysis
is 0.020, indicating that the average sentiment score only explains 2% of the variance in
purchase count. Additionally, the p-value is 0.114, which is greater than 0.05, suggesting
that the impact of sentiment score on purchase count is not statistically significant.
35
Figure 5.2: OLS on How SA Score Affects Purchase Number by Month
Diverse Experiences from High Purchase Volumes When the purchase quantity
of a product increases, the buyer group becomes more diverse. Among these buyers, some
may have higher expectations, or their needs and preferences might not completely align
with the product, leading to lower ratings.
Expectation Effect Popular products often come with high expectations. When peo-
ple purchase popular products, they tend to expect them to perfectly meet their needs. If
the product fails to meet these high expectations, buyers may give negative reviews.
Quality Control Challenges When the sales volume of a product increases signif-
icantly, the pressure on production and supply chains also rises. Quality control can
become more challenging, leading to some quality issues and defects. These problems are
likely to be reflected in negative reviews.
Increased Visibility of Negative Reviews The sheer volume of reviews for popular
products increases the visibility of negative feedback. Some buyers, after seeing existing
36
Figure 5.3: Timeseries of SA Score and Purchase Count
negative reviews, may become more aware of the product’s shortcomings and be more
inclined to leave negative feedback after their purchase.
Competitor and Malicious Reviews Hot-selling products often attract the atten-
tion of competitors, who may leave negative reviews to undermine the competition. Ad-
ditionally, some buyers might leave unfair negative reviews due to other reasons, such as
delivery issues or customer service problems.
Echo Effect In cases of high sales volumes, the exchange of opinions among buyers
becomes more frequent. If some early reviews are negative, subsequent buyers who read
these reviews may be influenced. Even if their experience is neutral or positive, they
might still be inclined to give a negative review.
37
PCA OLS
38
Component Influencing Features Interpretation
x1 sc roberta, sc roberta sq, x1 can be seen as a combined score of sentiment and rat-
rating number ing numbers. A high x1 suggests high sentiment scores
and rating numbers.
x2 review length, helpful vote, x2 may represent the engagement and helpfulness aspect
has images of reviews. A high x2 indicates long, helpful reviews
with images.
x3 helpful vote, review length, x3 captures the temporal and helpfulness aspects of re-
year views. A high x3 indicates helpful, detailed reviews over
the years.
x4 rating, sc roberta sq, x4 may represent a periodic or seasonal sentiment and
month rating trend. A high x4 suggests high ratings and senti-
ment scores during certain months.
x5 weekday, average rating, x5 might capture weekly patterns in reviews and average
has images ratings. A high x5 indicates high average ratings on
specific days of the week.
x6 review length, helpful vote, x6 represents the detailed and helpful nature of reviews
rating along with their ratings. A high x6 indicates detailed,
highly-rated, and helpful reviews.
x7 has images, sc roberta sq, x7 might represent the visual aspect and sentiment of
weekday reviews on specific days. A high x7 indicates positive
reviews with images on certain weekdays.
x8 month, average rating, rat- x8 captures monthly trends in average ratings and rating
ing number numbers. A high x8 indicates high ratings and rating
numbers during certain months.
Significant Predictors x1, x2, x3 are statistically significant and have a meaningful
impact on purchase count.
Marginally Significant Predictors x5, x6, and x8 are marginally significant, sug-
gesting a potential influence on purchase count.
Partial Autocorrelation
From Figure 5.6, we can know that Partial Autocorrelation Coefficients after Lag 1 are
Close to 0 and within the blue confidence intervals. This means the partial autocorrela-
tions for these lags are not significant and suggests that the sentiment scores at different
lags do not have significant direct linear relationships. In other words, the current sen-
39
Figure 5.5: Partial Autocorrelation of Sentiment Socres for B00R1TAN7I
timent score is primarily influenced by the present time point rather than directly by
previous time points, which fits H1.
Since the PACF plot shows no significant partial autocorrelation coefficients beyond lag
0, an high order autoregressive term may not be necessary. It is advisable to reduce the
order of the AR term, such as trying ARIMA(1, 1, 1), given that the time series may still
be non-stationary, retaining a first-order differencing term (d = 1) is reasonable. Also as
there are no significant lagged autocorrelations in the PACF plot, the MA term can be
kept at a low-order.
Parameter Estimates The AR(1) coefficient is not significant (P > 0.05), indicating
a weak linear relationship between the current value and the first lagged value.
The MA(1) coefficient is highly significant (P < 0.05), indicating a strong linear relation-
ship between the current value and the first lagged error term.
The residual variance sig2 is significant, indicating variability in the residuals.
40
Figure 5.6: SARIMAX
Model Diagnostics Ljung-Box Q Test (L1): The P value is much greater than 0.05,
indicating that the residuals do not have significant autocorrelation and are white noise.
Jarque-Bera (JB) Test: The P value is 0.00, indicating that the residuals deviate from
normal distribution, possibly showing slight non-normality.
Heteroskedasticity Test (H):The P value is greater than 0.05, indicating no significant
heteroskedasticity and that the residual variance is relatively stable.
Skewness: -0.09, indicating a slight left skew in the residual distribution.
Kurtosis: 1.71, slightly lower than 3, indicating the residual distribution is slightly less
peaked than a normal distribution.
Conclusion The AR(1) term is not significant, whereas the MA(1) term is highly
significant, suggesting that the current value of the time series is mainly influenced by
the previous error term rather than the previous value directly. Residuals are white
noise, with no significant autocorrelation. Residual variance is stable, with no significant
heteroskedasticity. Despite residuals deviating slightly from normality, this non-normality
41
typically has a minimal impact on the time series model.
42
Chapter 6
Discussion
43
Behavioral Economics
The study provides actionable insights for product managers to monitor and respond to
sentiment trends effectively. Understanding the factors influencing sentiment scores can
help in making informed decisions about product improvements and marketing strategies.
For example, by identifying and addressing recurring issues in user feedback, product
managers can improve product quality and user satisfaction. Additionally, promptly
responding to negative reviews and engaging positively with customers can build brand
loyalty and trust.
Marketing Strategies
The role of positive and negative sentiment in shaping consumer perception highlights the
need for strategic marketing efforts. Leveraging positive reviews and managing negative
feedback promptly can significantly impact a product’s market success. Companies can
enhance their brand image by promoting positive reviews, using influencers, and engaging
with customers on social media. Additionally, establishing positive customer relationships
and providing excellent after-sales service can enhance customer satisfaction and word-
of-mouth promotion.
Continuous Improvement
Implementing a robust feedback loop based on sentiment analysis can lead to continuous
product and service enhancements. This approach ensures that consumer expectations
are met, thereby fostering loyalty and positive word-of-mouth. Companies should regu-
larly collect user feedback, analyze the data to identify improvement opportunities, and
take swift action. Moreover, transparently communicating improvements to users can
44
further build trust and satisfaction.
Data Limitations
The analysis was conducted on a specific dataset (Amazon reviews in the beauty cate-
gory). Future research could explore diverse datasets across different product categories
and platforms to generalize the findings. Different product categories may have distinct
user bases and review habits, and studying these differences can provide a more compre-
hensive sentiment analysis model. Additionally, cross-platform data integration can help
verify the universality and stability of the model.
Model Limitations
Although RoBERTa demonstrated high accuracy, exploring other advanced models like
GPT-3 or newer versions of BERT could provide additional insights and potentially better
performance. With the rapid development of NLP technology, new models and algorithms
are continually emerging. Researchers should stay updated on these technologies to ensure
the best performance in sentiment analysis.
Behavioral Insights
Temporal Dynamics
Further research could explore more sophisticated time-series models to capture the tem-
poral dynamics of sentiment scores and their causal relationships with external factors
like market trends and promotional activities. Understanding the time-based patterns
and trends in sentiment scores can help businesses predict future consumer behavior and
adjust their strategies accordingly.
45
6.4 Conclusion
This research underscores the transformative impact of advanced NLP models on sen-
timent analysis. By leveraging transformer-based models, businesses can gain nuanced
insights into consumer sentiment, enabling more informed decision-making. The integra-
tion of behavioral economics principles further enriches our understanding of consumer
behavior in the digital age. As NLP technologies continue to evolve, their applications
in sentiment analysis and beyond will undoubtedly expand, offering new avenues for re-
search and practical innovation. The findings from this study provide a solid foundation
for future explorations in the intersection of NLP, sentiment analysis, and behavioral
economics, ultimately contributing to more effective and consumer-centric business prac-
tices.
46