0% found this document useful (0 votes)
54 views54 pages

Sentiment Analysis Based On Roberta For Amazon Review: An Empirical Study On Decision Making

This study utilizes RoBERTa, a transformer-based model, to conduct sentiment analysis on Amazon product reviews, deriving sentiment scores that reflect the emotional tones of the reviews. It explores the relationship between these scores and behavioral economics principles, such as electronic word of mouth and consumer emotional reactions. The findings highlight the effectiveness of advanced NLP models in understanding consumer behavior, providing valuable insights for strategic decision-making and marketing practices.

Uploaded by

Faisal Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views54 pages

Sentiment Analysis Based On Roberta For Amazon Review: An Empirical Study On Decision Making

This study utilizes RoBERTa, a transformer-based model, to conduct sentiment analysis on Amazon product reviews, deriving sentiment scores that reflect the emotional tones of the reviews. It explores the relationship between these scores and behavioral economics principles, such as electronic word of mouth and consumer emotional reactions. The findings highlight the effectiveness of advanced NLP models in understanding consumer behavior, providing valuable insights for strategic decision-making and marketing practices.

Uploaded by

Faisal Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Sentiment Analysis Based on RoBERTa for Amazon Review:

An Empirical Study on Decision Making

by

Xinli GUO
arXiv:2411.00796v1 [cs.LG] 18 Oct 2024

Master of Science

Department of UFR de mathématiques et informatique (UFR27)

University of Paris 1: Panthe´on-Sorbonne´

© Xinli GUO, 2024


Sentiment Analysis Based on RoBERTa for Amazon
Review: An Empirical Study on Decision Making

Xinli GUO

2
Abstract

In this study, we leverage state-of-the-art Natural Language Processing (NLP) techniques


to perform sentiment analysis on Amazon product reviews. By employing transformer-
based models, RoBERTa, we analyze a vast dataset to derive sentiment scores that accu-
rately reflect the emotional tones of the reviews. We provide an in-depth explanation of
the underlying principles of these models and evaluate their performance in generating
sentiment scores. Further, we conduct comprehensive data analysis and visualization to
identify patterns and trends in sentiment scores, examining their alignment with behav-
ioral economics principles such as electronic word of mouth (eWOM), consumer emotional
reactions, and the confirmation bias. Our findings demonstrate the efficacy of advanced
NLP models in sentiment analysis and offer valuable insights into consumer behavior,
with implications for strategic decision-making and marketing practices.

i
Attention Is All You Need.

– Vaswani et al. 2017.

ii
Contents

1 Introduction 1
2 Literature Review 3
2.1 NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Deep Learning in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Transformer-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Applications in Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . 5
2.5 Behavioral Economics in Sentiment Analysis . . . . . . . . . . . . . . . . 5
3 Research Model 7
3.1 NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Basic Concepts of NLP . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.2 Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . 10
3.3 BERT: Bidirectional Encoder Representations from Transformers . . . . 12
3.3.1 Masked Language Model (MLM) . . . . . . . . . . . . . . . . . . 12
3.3.2 Next Sentence Prediction (NSP) . . . . . . . . . . . . . . . . . . . 13
3.3.3 Total Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.5 Differences Between BERT and Transformer . . . . . . . . . . . . 15
3.4 RoBERTa: A Robustly Optimized BERT . . . . . . . . . . . . . . . . . 15
3.4.1 Improvements Over BERT . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.3 RoBERTa: Key Pre-Training Enhancements . . . . . . . . . . . . 17
3.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.5 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Method 20
4.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Computing Environment . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . 22
5 Result and Analysis 29
5.1 Data Processing for Data Analysis . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Analysis and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 Case: B00R1TAN7I . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Further Analysis and Discussion . . . . . . . . . . . . . . . . . . . 35
6 Discussion 43
6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Implications for Theory and Practice . . . . . . . . . . . . . . . . . . . . 43
6.2.1 Theoretical Implications . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Limitations and Future Research Directions . . . . . . . . . . . . . . . . 45

iii
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

iv
List of Tables

3.1 Comparison of BERT and Transformer . . . . . . . . . . . . . . . . . . . 15


3.2 Quantitative Improvements of RoBERTa over BERT . . . . . . . . . . . 16
3.3 Pre-Training Enhancements in RoBERTa Compared to BERT . . . . . . 18
3.4 Performance Comparison on NLP Benchmarks . . . . . . . . . . . . . . . 19
4.1 User Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Item Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Training Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Classification Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Interpretation of Principal Components . . . . . . . . . . . . . . . . . . . 39

v
List of Figures

2.1 History of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


3.1 Transformer Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Overall pre-training and fine-tuning procedures for BERT. . . . . . . . . 14
4.1 Threshold vs F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Confusion Matrix at Best Threshold . . . . . . . . . . . . . . . . . . . . 26
4.3 Receiver Operating Characteristic Curve of Model . . . . . . . . . . . . . 27
5.1 Time-series of B00R1TAN7I Sentiment Score . . . . . . . . . . . . . . . . 31
5.2 OLS on How SA Score Affects Purchase Number by Month . . . . . . . . 36
5.3 Timeseries of SA Score and Purchase Count . . . . . . . . . . . . . . . . 37
5.4 OLS with More Variables and PCA . . . . . . . . . . . . . . . . . . . . . 38
5.5 Partial Autocorrelation of Sentiment Socres for B00R1TAN7I . . . . . . 40
5.6 SARIMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi
Chapter 1

Introduction

In recent years, the field of Natural Language Processing (NLP) has witnessed significant
advancements, particularly with the development of transformer-based models. These
models, such as RoBERTa and DistilBERT, have demonstrated remarkable capabilities
in various NLP tasks, including sentiment analysis. Sentiment analysis, the process of
determining the emotional tone behind a body of text, is crucial for understanding con-
sumer opinions and behaviors. In this study, we employ RoBERTa to perform sentiment
analysis on Amazon product reviews, aiming to derive sentiment scores that reflect the
underlying emotions expressed in the reviews.

RoBERTa (Robustly Optimized BERT Pre-training Approach) is an enhanced version of


BERT (Bidirectional Encoder Representations from Transformers), designed to improve
performance by training with larger mini-batches and longer sequences. It leverages the
transformer architecture, which relies on self-attention mechanisms to process and encode
the context of words in a sentence effectively.

In our research, we utilize these models to analyze a substantial dataset of Amazon prod-
uct reviews. By applying these state-of-the-art NLP techniques, we generate sentiment
scores for each review, quantifying the positivity or negativity of the expressed sentiments.
This allows us to evaluate the accuracy of sentiment scores produced by RoBERTa

Beyond merely obtaining sentiment scores, our study delves into data analysis and vi-
sualization to observe patterns and trends in review sentiments. Through this, we ex-
plore how these sentiment scores align with principles of behavioral economics, such as
electronic word-of-mouth (eWOM), consumer emotional reactions, and the confirmation
bias. eWOM refers to the influence of online user-generated content on consumer deci-
sions, consumer emotional reactions describe how emotions affect purchasing behaviors,

1
and confirmation bias highlights how individuals tend to favor information that confirms
their preexisting beliefs.

By integrating advanced NLP techniques with behavioral economics theories, this re-
search not only provides insights into consumer sentiment on Amazon but also demon-
strates the broader applicability of transformer-based models in understanding complex
human behaviors. The findings of this study have significant implications for businesses
and marketers aiming to leverage sentiment analysis for strategic decision-making and
consumer engagement.

2
Chapter 2

Literature Review

2.1 NLP
Natural Language Processing (NLP) has evolved significantly over the past few decades,
transitioning from rule-based systems to more sophisticated machine learning techniques.
Early sentiment analysis approaches relied heavily on lexicon-based methods, where pre-
defined dictionaries of positive and negative words were used to classify text. While
these methods were straightforward, they often struggled with context and nuances in
language.

Figure 2.1: History of NLP

But there were some challenges in NLP application.

Challenge Description
Word Sense A single word can have multiple meanings depending
Disambiguation on the context. For example, the word ”bank” can
mean a financial institution or the side of a river.

3
Ambiguity For example, ”I saw the man with a telescope” can
mean either ”I used a telescope to see the man” or ”I
saw a man who had a telescope.”
Syntactic Analysis Different languages have different syntactic rules, and
parsing long and complex sentences can be difficult.
Semantic Analysis Requires understanding implicit semantics and
context. For example, ”He sold the car he bought last
week” implies understanding the temporal relationship.

Classical machine learning techniques, such as Naive Bayes and Support Vector Machines,
brought improvements by learning from labeled datasets. These methods could capture
some context and were more flexible than lexicon-based approaches. However, they still
had limitations, particularly in handling complex linguistic structures and long-range
dependencies in text.

2.2 Deep Learning in NLP


Deep learning has revolutionized the field of NLP by providing powerful tools to model
complex patterns and representations in language. Unlike traditional methods, deep
learning models can automatically learn features from raw text data, making them highly
effective for a wide range of NLP tasks such as language translation, sentiment analysis,
and text generation. Deep learning with powerful neural network architectures, has sig-
nificantly advanced the field of NLP. CNNs and RNNs have been instrumental in various
tasks, with RNNs and their variant LSTMs being particularly effective for sequential
data. While RNNs suffer from issues like gradient vanishing, LSTMs address these prob-
lems, enabling better handling of long-term dependencies in sequences. Despite their
complexity, LSTMs have become a cornerstone in modern NLP applications due to their
robustness and effectiveness.

2.3 Transformer-based Models


The advent of transformer models marked a significant breakthrough in NLP. Introduced
by Vaswani et al. (2017), the transformer architecture relies on self-attention mecha-
nisms to process and encode the context of words in a sentence more effectively. This
architecture paved the way for models like BERT (Bidirectional Encoder Representa-

4
tions from Transformers), which utilized bidirectional context to achieve state-of-the-art
performance in various NLP tasks.

RoBERTa (Robustly Optimized BERT Pre-training Approach) improved upon BERT


by training with larger mini-batches, longer sequences, and more data, leading to better
performance on several benchmarks (Liu et al., 2019).

2.4 Applications in Sentiment Analysis


Transformer-based models have been extensively applied to sentiment analysis. Studies
have demonstrated that these models outperform traditional methods in various con-
texts, including social media, movie reviews, and product reviews. For instance, Sun
et al. (2019) showed that BERT-based models achieved superior accuracy in classifying
the sentiment of tweets compared to previous methods. Similarly, RoBERTa have been
successfully used to analyze sentiments in diverse domains, proving their robustness and
adaptability.

Specific to product reviews, researchers have utilized these models to gain insights into
consumer opinions. Liu et al. (2020) applied RoBERTa to Amazon reviews, highlighting
its effectiveness in capturing nuanced sentiments and outperforming older techniques.
These studies underline the models’ capabilities in understanding and interpreting com-
plex emotional expressions in text.

2.5 Behavioral Economics in Sentiment Analysis


Behavioral economics principles such as electronic word of mouth (eWOM), the snowball
effect, and the herd effect are critical in understanding consumer behavior. eWOM refers
to the influence of online user-generated content on consumer decisions, a phenomenon
extensively studied in the context of online reviews (Cheung & Thadani, 2012). The
snowball effect describes how information dissemination grows exponentially, and the
herd effect highlights how individuals often follow the behavior of the majority (Banerjee,
1992).

Sentiment analysis has been a valuable tool in studying these phenomena. For example,
research by Hu et al. (2014) demonstrated how sentiment trends in online reviews could
predict consumer purchasing behavior, illustrating the snowball effect. Similarly, studies
on the herd effect have used sentiment analysis to show how positive or negative reviews

5
can influence subsequent reviewers’ sentiments (Liu & Zhang, 2019).

6
Chapter 3

Research Model

3.1 NLP
3.1.1 Basic Concepts of NLP
Syntactic Analysis

Syntactic analysis involves breaking down sentences into their components and under-
standing the grammatical relationships between them. It includes both syntactic parsing
and semantic analysis:

Syntactic Parsing

Part-of-Speech Tagging (POS Tagging) Assigns a part of speech (like noun, verb,
adjective) to each word in a sentence. Syntax Tree Construction, builds a tree structure
to represent the grammatical structure of a sentence, showing the relationships between
words.

Semantic Analysis

Named Entity Recognition (NER) Identifies entities in the text, such as names of
people, places, organizations, etc.

Semantic Role Labeling (SRL) Labels the roles words play in the sentence’s mean-
ing, such as the agent and patient of an action.

For the sentence ”The cat sleeps on the table,” syntactic analysis would identify ”cat”
as a noun (subject), ”sleeps” as a verb (predicate), and ”on the table” as a prepositional
phrase indicating location.

7
Word Vector

Word vector representation converts words into numerical vectors so that computers can
process and understand them. These vectors capture semantic relationships between
words and are used in various NLP tasks.

Bag of Words (BoW) Represents text as vectors of word counts, ignoring word order
and grammatical structure.

Word Embeddings Maps words into a continuous vector space, where semantically
similar words are closer together. Common methods include Word2Vec and GloVe.

Contextual Word Representations Uses deep learning models (like BERT, ELMo)
to generate word vectors that take into account the word’s context within a sentence.

3.2 Transformer Architecture


The Transformer model was introduced to address the limitations of Recurrent Neural
Networks (RNNs), especially in handling long-range dependencies and parallelization.
RNNs, despite their effectiveness in sequence modeling, suffer from issues like gradient
vanishing/exploding and slow training times due to their sequential nature. Transformers
overcome these limitations by relying entirely on self-attention mechanisms, allowing for
better parallelization and handling of long-range dependencies in sequences.

3.2.1 Architecture
The Transformer model consists of an Encoder and a Decoder. Both components are
composed of stacked layers, each containing sub-layers such as self-attention mechanisms
and feed-forward neural networks.

Encoder: The encoder is responsible for processing the input sequence and generating
a set of representations. The encoder consists of a stack of six identical layers (N = 6).
Each layer contains two sub-layers:

A multi-head self-attention mechanism. A position-wise fully connected feed-forward


network. Each sub-layer is wrapped with a residual connection and followed by layer
normalization. The output of each sub-layer is computed as:

LayerNorm(x + Sublayer(x))

8
where Sublayer(x) represents the function implemented by the sub-layer. To support
these residual connections, all sub-layers and the embedding layers in the model produce
outputs with a fixed dimension of dmodel = 512

Decoder: The decoder generates the output sequence, typically used for tasks like
machine translation. The decoder also consists of a stack of six identical layers (N = 6).
Each decoder layer contains three sub-layers:

A multi-head self-attention mechanism. A position-wise fully connected feed-forward


network. A multi-head attention mechanism that attends to the output of the encoder
stack. Similar to the encoder, residual connections are employed around each of the sub-
layers, followed by layer normalization. Additionally, the self-attention sub-layer in the
decoder is modified to prevent positions from attending to subsequent positions. This
masking, combined with the offset of output embeddings by one position, ensures that
predictions for position i depend only on the known outputs at positions less than i.

Figure 3.1: Transformer Model Architecture

9
3.2.2 Self-Attention Mechanism
Self-attention allows the model to weigh the importance of different words in a sentence
when encoding a word. For instance, in the sentence “The cat sat on the mat,” the
word “cat” has a strong connection with “sat” and “mat.” The self-attention mechanism
allows the model to consider these relationships simultaneously rather than sequentially,
improving the handling of long-range dependencies and context.

Scaled Dot-Product Attention

1. Input Representation: Each word in the input sequence is converted into three vectors:
Query (Q), Key (K), and Value (V) using learned weight matrices.

Q = XWQ ,
K = XWK , (3.1)
V = XWV

where X is the input matrix, and WQ , WK , WV are weight matrices.

2. Scaled Dot-Product Attention: The attention scores are computed using the dot
product of the query and key vectors, scaled by the square root of the dimension of
the key vectors. These scores are then passed through a softmax function to obtain the
attention weights.

QK T

Attention(Q, K, V ) = softmax √ V
dk
3. Output: The weighted sum of the value vectors produces the output of the self-
attention mechanism.

Multi-Head Attention

Multi-head attention enhances the model’s ability to focus on different parts of the in-
put sequence simultaneously by applying multiple attention mechanisms in parallel. It
Enables the model to capture diverse patterns and dependencies within the data, also
improves the capacity to model complex relationships in the input.

MultiHead(Q, K, V ) = Concat(head1 , head2 , . . . , headh )WO

where each head is computed as:

10
Figure 3.2: Scaled Dot-Product Attention

Figure 3.3: Multi-Head Attention

headi = Attention(QWQi , KWKi , V WVi )

11
3.3 BERT: Bidirectional Encoder Representations from
Transformers
BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP
by introducing bidirectional context understanding. Traditional models like RNNs and
earlier transformers considered context either from left-to-right or right-to-left, but not
both simultaneously. BERT, however, reads the entire sentence at once, understand-
ing the context from both directions.It is a pre-trained language model based on the
Transformer architecture, using the Masked Language Model (MLM) and Next Sentence
Prediction (NSP) tasks and then fine-tuned for specific downstream tasks.

BERT uses the encoder part of the Transformer architecture, which consists of multi-head
self-attention mechanisms and feed-forward neural networks.

3.3.1 Masked Language Model (MLM)


In the MLM task, some of the input words are randomly masked, and the model is trained
to predict these masked words based on the context.

Masking

Randomly select some tokens in the input sequence and replace them with a [MASK]
token. For a masked position i, the objective is to maximize the probability:

P (wi | w1 , . . . , wi−1 , wi+1 , . . . , wn )

Loss Function

Use cross-entropy loss to measure the difference between the predicted and actual tokens.
The cross-entropy loss for predicting the masked token wi is given by:

X
LMLM = − log P (wi | wcontext )
i∈mask

where:

• LMLM is the MLM loss.

• mask represents the positions of the masked tokens.

• P (wi | wcontext ) is the predicted probability of the masked token wi given the
context.

12
3.3.2 Next Sentence Prediction (NSP)
In the NSP task, the model is trained to predict whether two sentences are consecutive
in the original text. The cross-entropy loss for NSP is given by:

 
LNSP = − y log P (IsNext | w[CLS] ) + (1 − y) log P (NotNext | w[CLS] )

where:

• LNSP is the NSP loss.

• y is a binary indicator (1 if the second sentence is the actual next sentence, 0


otherwise).

• P (IsNext | w[CLS] ) is the predicted probability that the second sentence is the actual
next sentence.

• P (NotNext | w[CLS] ) is the predicted probability that the second sentence is a


random sentence.

Sentence Pairs

Construct pairs of sentences (A, B) where 50% of the time B is the actual next sentence
following A, and 50% of the time B is a random sentence.

Classification Task

Use the representation of the [CLS] token to perform a binary classification task, aiming
to maximize the probability:
P (IsNext | [CLS])

Loss Function

Use cross-entropy loss to measure the difference between the predicted and actual la-
bels.

3.3.3 Total Loss


The total loss for BERT is a weighted sum of the MLM loss and the NSP loss:

L = LMLM + LNSP

13
Figure 3.4: Overall pre-training and fine-tuning procedures for BERT.

where LMLM is the loss from the masked language model, and LNSP is the loss from the
next sentence prediction.

3.3.4 Training Process


Pre-training

BERT is trained on a large corpus using unsupervised tasks like Masked Language Model
(MLM) and Next Sentence Prediction (NSP). MLM will Randomly mask some tokens
in the input and trains the model to predict the masked tokens based on the context,
While NSP Trains the model to understand the relationship between two sentences by
predicting if a given sentence pair follows each other in the text.

Fine-tuning

This pre-training allows BERT to capture rich linguistic representations. However, to


perform well on specific downstream tasks (e.g., question answering, text classification,
named entity recognition, sentimental analysis), it must be fine-tuned. After initialized
Pre-trained Model, add a task-specific output layer on top of BERT. The specific archi-
tecture of this layer depends on the task at hand. For Sentiment Analysis, we need add
a fully connected layer followed by a softmax function to output sentiment probabilities
(e.g., positive, negative, neutral).

14
Table 3.1: Comparison of BERT and Transformer
Aspect Transformer BERT
Architecture - Encoder-Decoder struc- - Uses only the encoder part
ture of the Transformer
- Encoder: stack of identical - Encoder: stack of identical
layers with multi-head self- layers with multi-head self-
attention and feed-forward attention and feed-forward
neural network neural network
- Decoder: stack of identical - Bidirectional processing
layers with multi-head self-
attention, encoder-decoder
attention, and feed-forward
neural network
Training Ob- - Typically trained for spe- - Pre-training: Masked Lan-
jectives cific tasks like translation guage Model (MLM) and
using teacher forcing Next Sentence Prediction
- Objective: minimize loss (NSP)
for predicting the next word - Fine-tuning: on specific
in sequence given previous downstream tasks with la-
words and source sentence beled data
Directionality - Unidirectional in decoder - Bidirectional: processes
(left-to-right) the entire sequence of words
simultaneously
Use Cases and - Machine Translation - Understanding and classi-
Applications - Text Generation fying text
- Sequence-to-sequence ap- - Sentiment Analysis
plications - Named Entity Recognition
- Question Answering
- Information Retrieval

3.3.5 Differences Between BERT and Transformer

3.4 RoBERTa: A Robustly Optimized BERT


RoBERTa is an optimized version of BERT introduced by Facebook AI. It builds upon
the original BERT architecture by making several key modifications to improve its per-
formance and robustness.

3.4.1 Improvements Over BERT


Explanation of Differences

• Training Data Size: RoBERTa uses a significantly larger dataset, which includes
additional data from Common Crawl and OpenWebText.

15
Table 3.2: Quantitative Improvements of RoBERTa over BERT
Characteristic BERT RoBERTa
Training Data Size 16GB 160GB
Training Steps 1 million 500,000
Batch Size 256 8,192
Learning Rate 1e−4 1e−4 with warm-up
Masking Strategy Static Masking Dynamic Masking
Next Sentence Prediction (NSP) Yes No

• Training Steps: BERT is trained for 1 million steps, whereas RoBERTa is trained
for 500,000 steps.

• Batch Size: RoBERTa uses a much larger batch size of 8,192 compared to BERT’s
256.

• Learning Rate: Both use a learning rate of 1e−4 , but RoBERTa includes a warm-
up period.

• Masking Strategy: BERT uses static masking, while RoBERTa uses dynamic
masking, changing the masking pattern each epoch.

• Next Sentence Prediction (NSP): RoBERTa removes the NSP task, focusing
solely on the MLM task.

• Training Time: RoBERTa is trained for a longer period with more iterations
compared to BERT.

• Pre-training Tasks: BERT is pre-trained on both MLM and NSP tasks, while
RoBERTa is pre-trained only on the MLM task.

3.4.2 Architecture
RoBERTa retains the same architecture as BERT:

• Encoder-Only Architecture: Uses the Transformer encoder.

• Multi-Head Self-Attention: Focuses on different parts of the input sequence


simultaneously.

• Feed-Forward Neural Networks: Applied after the self-attention mechanism in


each layer.

16
• Layer Normalization and Residual Connections: Stabilizes and enhances the
training process.

3.4.3 RoBERTa: Key Pre-Training Enhancements

17
Table 3.3: Pre-Training Enhancements in RoBERTa Compared to BERT
Aspect Description and Examples
Pre-Training Task RoBERTa: Uses only the Masked Language Model (MLM) task.
Simplifies the training objective and improves the model’s under-
standing of context and semantics.
BERT: Uses both MLM and Next Sentence Prediction (NSP)
tasks.
Example for MLM:
Original Sentence: ”The quick brown fox jumps over the lazy
dog.”
Masked Sentence: ”The quick brown [MASK] jumps over the
[MASK] dog.”
Training Objective: Predict ”fox” and ”lazy” based on the con-
text.
Dynamic Masking RoBERTa: Generates a new masking pattern for each sequence
during training, enhancing the model’s ability to generalize by ex-
posing it to a variety of masking patterns.
BERT: Uses static masking, where the masking pattern is fixed
and reused throughout training.
Example for Dynamic Masking:
First Epoch: ”The quick brown [MASK] jumps over the lazy
[MASK].”
Second Epoch: ”The [MASK] brown fox jumps [MASK] the
lazy dog.”
Benefit: Allows the model to learn robust contextual represen-
tations by varying the masking patterns.

18
3.4.4 Performance

Table 3.4: Performance Comparison on NLP Benchmarks


Benchmark Metric BERT RoBERTa
GLUE Average Score 79.6 88.5
SQuAD 1.1 F1 Score 93.2 94.6
SQuAD 2.0 F1 Score 83.0 89.8
RACE Accuracy 66.8 83.2

3.4.5 Generalization
RoBERTa exhibits better generalization capabilities due to:

Larger and more diverse training data. Training on a larger and more varied dataset
allows RoBERTa to encounter a wider variety of linguistic contexts, improving its ability
to generalize to new, unseen data. The vast amount of data ensures that the model can
learn from more examples, reducing the likelihood of overfitting to specific patterns in
the training data.

Longer training with more iterations. RoBERTa is trained with more iterations and
longer training times compared to BERT. While BERT was trained for 1 million steps,
RoBERTa uses a more extensive training schedule, allowing the model to converge better.
More training iterations enable the model to better capture the underlying data distribu-
tion, leading to improved performance on downstream tasks. And longer training allows
RoBERTa to refine its internal representations, making them more robust and effective
at generalizing to new tasks

Removal of the NSP task, which focuses the model on a single, more effective pre-training
task. NSP has been found to be less relevant for many downstream tasks. Removing it
helps the model to avoid learning spurious correlations that do not generalize well.

19
Chapter 4

Method

4.1 Sentiment Analysis


4.1.1 Computing Environment
The model training was conducted on Google Colab Pro, utilizing an NVIDIA A100
GPU. The system had 51GB of system RAM and 15GB of GPU RAM available. For
data analysis, Python was employed on a local machine equipped with an Apple M1 Pro
chip and 16GB of system RAM.

4.1.2 Data Collection


This paper useds all-beauty category dataset of the open source dataset - Amazon Re-
views23, which is from McAuley lab, University of California San Diego, and it includes
rich features:

User Reviews

Table 4.1: User Reviews


Field Type Explanation
rating float Rating of the product (from 1.0 to 5.0).
title str Title of the user review.
text str Text body of the user review.
images list Images that users post after they have received the product.
parent asin str Parent ID of the product.
user id str ID of the reviewer.
timestamp int Time of the review (unix time).
verified purchase bool User purchase verification.
helpful vote int Helpful votes of the review.

20
Item Metadata

Table 4.2: Item Metadata


Field Type Explanation
main category str Main category (i.e., domain) of the product.
title str Name of the product.
average rating float Rating of the product shown on the product page.
rating number int Number of ratings in the product.
features list Bullet-point format features of the product.
description list Description of the product.
price float Price in US dollars (at time of crawling).
images list Images of the product.
videos list Videos of the product including title and url.
store str Store name of the product.
categories list Hierarchical categories of the product.
details dict Product details, including materials, brand, sizes, etc.
parent asin str Parent ID of the product.
bought together list Recommended bundles from the websites.

First, we merge the two datasets using parent asin as the primary key and construct
the following new features:

Operation Description
Calculating Review This line calculates the length of each review in the text
Length column and stores the result in a new column called
review length.
Filling Missing Help- This line fills any missing values (NaN) in the
ful Vote Values helpful vote column with 0.
Converting Verified Using a lambda function, this line converts the boolean
Purchase to Binary values (True/False) in the verified purchase column
Variable to integers (1 or 0). True is converted to 1, and False is
converted to 0.
Checking for Images Using a lambda function, this line converts the values
in the images column to a binary variable. If there are
images (non-empty value), it is set to 1; otherwise, it is
set to 0.
Converting Times- This line converts the timestamp column from millisec-
tamp to Datetime onds since epoch to a datetime object, allowing for fur-
ther time-related operations.
Extracting Year This line extracts the year from the timestamp column
and stores it in a new column called year.
Extracting Month This line extracts the month from the timestamp col-
umn and stores it in a new column called month.
Extracting Day This line extracts the day from the timestamp column
and stores it in a new column called day.

21
Operation Description
Extracting Weekday This line extracts the weekday (0-6, representing Mon-
day to Sunday) from the timestamp column and stores
it in a new column called weekday.

4.1.3 Model Training and Evaluation


Training: The selected open-source model from Hugging Face is Fine-Tuned with Ama-
zon Reviews datasets. This fine-tuned model has very high accuracy (91.85%)on the
publisher’s dataset.
The following hyperparameters were used during training:

Table 4.4: Training Hyperparameters


Hyperparameter Value
Learning Rate 2 × 10−5
Train Batch Size 16
Evaluation Batch Size 16
Seed 42
Optimizer Adam with β1 = 0.9, β2 = 0.999, and ϵ = 1 × 10−8
Learning Rate Scheduler Type Linear
Number of Epochs 2

1
2 from transformers import Au toMode lForSe quence Classi ficati on ,
AutoTokenizer
3 import torch
4 import numpy as np
5 from tqdm import tqdm
6 from concurrent . futures import Thr ea dP oo lE xe cu to r
7
8 # Load the RoBERTa pre - trained model and the corresponding tokenizer .
9 # Load the model onto the CUDA device ( GPU ) for inference .
10
11 model_name = " Proggleb / roberta - base - bne - finetuned - a m a z o n _ r e v i e w s _ m u l t i "
12 tokenizer = AutoTokenizer . from_pretrained ( model_name )
13 model = A u t o M o d e l F o r S e q u e n c e C l a s s i f i c a t i o n . from_pretrained ( model_name ) .
to ( ’ cuda ’)
14
15 # The truncate_text function is used to truncate text that exceeds the
maximum length , which will help me decrease GPU memory and RAM usage
, speed up the training
16 def truncate_text ( text , max_length =512) :
17 return text [: max_length ]
18 # The p r e p r o c e s s _ a n d _ a n a l y z e function tokenizes the text , encodes it ,
and feeds it into the model to compute sentiment scores .

22
19 # The softmax function is used to compute the probability of each
sentiment class , and the difference between the positive and
negative sentiment scores is calculated .
20 def p r e p r o c e s s _ a n d _ a n a l y z e ( texts ) :
21 inputs = tokenizer ( texts , padding = True , truncation = True , max_length
=512 , return_tensors = " pt " ) . to ( ’ cuda ’)
22 with torch . no_grad () :
23 logits = model (** inputs ) . logits
24 scores = torch . nn . functional . softmax ( logits , dim = -1)
25 sentiments = scores [: , 1] - scores [: , 0] # POSITIVE score -
NEGATIVE score
26 return sentiments . cpu () . numpy ()
27 # The p a r al l e l_ p r oc e s si n g function processes each batch of text data ,
calls the p r e p r o c e s s _ a n d _ a n a l y z e function for analysis , and returns
the sentiment scores . Same purpose of Truncation , it allows me to
mapping function to processing data by 20 units per batch .
28 def p ar a l le l _ pr o c es s i n g ( batch_texts ) :
29 batc h_pred ictio ns = p r e p r o c e s s _ a n d _ a n a l y z e ( batch_texts )
30 return np . array ( ba tch_pr edicti ons )
31
32 batch_size = 20
33 all_predictions = []
34
35 with T hr ea dP oo lE xe cu to r ( max_workers =8) as executor :
36 futures = []
37 for i in range (0 , len ( df ) , batch_size ) :
38 batch_texts = df [ ’ text ’ ][ i : i + batch_size ]. tolist ()
39 futures . append ( executor . submit ( parallel_processing , batch_texts
))
40
41 for future in tqdm ( futures , desc = " Processing batches " ) :
42 all_predictions . extend ( future . result () )
43
44 df [ ’ sentiment_score ’] = all_predictions
45 df . to_json ( ’ s e n t i m e n t _ a n a l y s i s _ r e s u l t s _ r o b e r t a . jsonl ’ , orient = ’ records ’
, lines = True )
46 print ( df . head () )
Listing 4.1: Training Process

Batch Mapping Function for Parallel Processing


1 import numpy as np
2 from concurrent . futures import Thr ea dP oo lE xe cu to r
3 from tqdm import tqdm
4
5 def p ar a l le l _ pr o c es s i n g ( batch_texts ) :
6 """
7 This function preprocesses and analyzes a batch of texts .
8
9 Args :
10 batch_texts ( list of str ) : A list of text samples to be processed .
11
12 Returns :
13 np . ndarray : An array of predictions resulting from the analysis of
the batch .
14 """
15 batc h_pred ictio ns = p r e p r o c e s s _ a n d _ a n a l y z e ( batch_texts )

23
16 return np . array ( ba tch_pr edicti ons )
17
18 # Define the batch size for processing
19 batch_size = 20
20 # Initialize an empty list to store all predictions
21 all_predictions = []
22

23 # Utilize Th re ad Po ol Ex ec ut or to parallelize the processing


24 with T hr ea dP oo lE xe cu to r ( max_workers =8) as executor :
25 futures = []
26 # Split the dataset into batches and submit them for parallel
processing
27 for i in range (0 , len ( df ) , batch_size ) :
28 batch_texts = df [ ’ text ’ ][ i : i + batch_size ]. tolist ()
29 futures . append ( executor . submit ( parallel_processing , batch_texts
))
30
31 # Collect the results from the futures as they complete
32 for future in tqdm ( futures , desc = " Processing batches " ) :
33 all_predictions . extend ( future . result () )
Listing 4.2: Batch Mapping Function for Parallel Processing

24
Evaluation

For evaluation section, we define the task of sentiment classification as a binary classifica-
tion problem, categorizing emotions into positive and negative. Based on the rating star
labels, we convert the emotions into different categories: 1 to 3 stars represent negative
emotions, while 4 to 5 stars represent positive emotions. By evaluating the distribution
of the model’s F1 scores under different thresholds, we determined the optimal threshold
to be -0.6. Accordingly, we set the rating range to [-1, -0.6] for negative emotions and
[-0.6, 1] for positive emotions.

Figure 4.1: Threshold vs F1 Score

Predict \True Negative Positive


Negative TN FN
Positive FP TP

Table 4.5: Confusion Matrix

Interpretation of Classification Report:

TP + TN
Accuracy =
TP + TN + FP + FN

Explanation: The proportion of correct predictions made by the model. The provided
accuracy is 0.8844, indicating that the model correctly predicts 88.44% of the cases.

25
Figure 4.2: Confusion Matrix at Best Threshold

TP
P recision =
TP + FP
Explanation: The proportion of true positive predictions among all positive predictions
made by the model. The provided precision is 0.8836.

TP
Recall =
TP + FN
Explanation: The proportion of true positive predictions among all actual positive cases.
The provided recall is 0.8844.

P recision × Recall
F 1Score = 2 ×
P recision + Recall
Explanation: The harmonic mean of precision and recall, providing a balance between
the two metrics. The provided F1 score is 0.8840.

In a binary classification problem, the construction of the ROC curve depends on the False

26
Class Precision Recall F1-score Support
Negative 0.81 0.79 0.80 201421
Positive 0.91 0.92 0.92 500107
Accuracy 0.88 701528
Macro avg 0.86 0.86 0.86 701528
Weighted avg 0.88 0.88 0.88 701528

Table 4.6: Classification Report

Figure 4.3: Receiver Operating Characteristic Curve of Model

Positive Rate (FPR) and True Positive Rate (TPR) calculated at different thresholds.
Below are the specific steps for calculating FPR and TPR:

False Positive Rate (FPR): The proportion of actual negative (negative) samples incor-
rectly predicted as positive (positive). The formula is:

FP
FPR =
FP + TN

where FP (False Positive) is the number of false positives, and TN (True Negative) is the
number of true negatives.

True Positive Rate (TPR): The proportion of actual positive (positive) samples correctly
predicted as positive (positive). The formula is:

TP
TPR =
TP + FN

27
where TP (True Positive) is the number of true positives, and FN (False Negative) is the
number of false negatives.

For each possible threshold (from 0 to 1), calculate the FPR and TPR at that threshold.
At each threshold:

If the model’s predicted score is greater than or equal to the threshold, predict as positive
(positive). Otherwise, predict as negative (negative).

The Area Under the Curve (AUC) is the total area under the ROC curve. The closer the
AUC value is to 1, the better the model’s discriminative ability.

Using the FPR and TPR to construct the curve, calculate the AUC using the trapezoidal
rule. The formula for the trapezoidal rule is:
n−1  
X TPRi+1 + TPRi
AUC = × (FPRi+1 − FPRi )
i=1
2

where n is the number of thresholds.

The AUC value is 0.93, indicating that the model performs very well in distinguishing
between negative and positive sentiments.

28
Chapter 5

Result and Analysis

5.1 Data Processing for Data Analysis


For Data Analysis part, we segmented the dataset using parent asin as the primary key,
filtering out products with fewer than 1100 reviews and those with missing price values.
After this filtering process, we selected the top four products with the highest number of
reviews. The detailed information of these products is as follows:

Parent asin Products Name


B00R1TAN7I GranNaturals Boar Bristle Smoothing Hair Brush for
Women and Men - Medium/Soft Bristles - Natural
Wooden Large Flat Square Paddle Hairbrush for Fine,
Thin, Straight, Long, or Short Hair
B019GBG0IE Collapsible Hair Diffuser by The Curly Co. with The
Curly Co. Satisfaction Guarantee
B01M1OFZOG Bed Head Curve Check Curling Wand for Tousled Waves
and Texture, Jumbo Barrel
B085BB7B1M Salux Nylon Japanese Beauty Skin Bath Wash Cloth/-
towel (3) Blue Yellow and Pink

For further discussion, we need three hypotheses:

Aspect Detailed Explanation


H1: Purchase and Re- Immediate Feedback: User reviews are an immediate reflec-
view Occur Simulta- tion of their purchase experience, accurately capturing their
neously satisfaction or dissatisfaction.
Timeliness of Sentiment Scores: Sentiment scores can be con-
sidered a true reflection of users’ immediate experience with
the product, without the emotional changes that time lags
might bring.
Continued on next page

29
Table 5.2 – continued from previous page
Aspect Detailed Explanation
H2: Long Sales Cycle Quality Consistency: The product maintains stable quality
with Stable Quality throughout its sales cycle, providing consistent user experi-
ence and not significantly altering review sentiments due to
quality fluctuations.
Long-Term Reputation Impact: Stable long-term quality
helps build a good reputation, making user reviews a more
accurate reflection of the product’s true quality.
H3: Basic Daily Ne- Low Impact of Technological Progress: Due to low technologi-
cessities with Low cal content, advancements in technology have minimal impact
Technological Content on product quality and sales volume. Users focus more on the
product’s practicality and cost-effectiveness.
Stable User Expectations: Users have stable expectations for
basic daily necessities, which are not significantly altered by
technological advancements, making review sentiments more
reflective of the actual usage experience.

Next, we will conduct an analysis case randomly draw from one of the four products.

30
5.2 Analysis and Solution
5.2.1 Case: B00R1TAN7I

Figure 5.1: Time-series of B00R1TAN7I Sentiment Score

Empirical Inference From the graph, we can see that the average sentiment score
fluctuates significantly in different periods. It can be roughly divided into the following
stages:

31
Time Period Description
2015 to Early 2017 Sentiment scores fluctuate significantly, with many
negative scores, and the overall average score is
relatively low.
Product Quality Issues: There may have been
quality issues during this period, leading to many
negative reviews.
Market Adaptation Period: The product was new
to the market, and there might have been a significant
gap between user expectations and actual experiences.
Consumer Sentiment Response: Consumers’
emotional responses directly influence their reviews,
and concentrated negative sentiment can lead others to
give negative reviews.
Mid-2017 to 2018 Sentiment scores rise, generally trending positive, but
there are still large fluctuations.
Product Improvements: The manufacturer may
have improved the product during this period,
increasing user satisfaction.
Positive Promotion: Effective marketing and
promotion may have led to more positive reviews.
Electronic Word-of-Mouth (eWOM):
Concentrated bursts of negative reviews and their
spread may lead to significant drops in sentiment
scores, while the spread of positive reviews can cause
scores to rise.
Late 2018 to Early 2019 Average sentiment scores drop significantly, with a
notable increase in negative sentiment scores.
Quality Issues or Service Failures: There might
have been major quality issues or service failures,
leading to a surge in negative reviews.
Negative Word-of-Mouth Effect: Negative reviews
spread quickly through electronic word-of-mouth
(eWOM), further amplifying negative sentiment.
Confirmation Bias: Users may be inclined to make
reviews consistent with existing sentiment scores,
amplifying a particular sentiment trend.
Continued on next page

32
Time Period Description
2019 to 2020 Sentiment scores gradually recover, but fluctuations
remain noticeable.
Quality Improvements: The manufacturer may
have implemented a series of improvements after
identifying issues, gradually regaining user trust.
Market Competition: Intense market competition
may have brought more user experiences, but also
more review volatility.
Anchoring Effect: Consumers may be influenced by
previous reviews, forming expectations about the
product that affect their subsequent sentiment scores,
explaining sustained high or low scores during certain
periods.
2021 to 2022 Sentiment scores rise again but do not reach previous
peaks, with many negative sentiments still present.
Stabilization Period: The product and service
became more stable, and sentiment scores gradually
rose, but previous negative impacts could not be
entirely eliminated.
Market Saturation: The market reached saturation,
and differences in expectations between new and
existing users might have resulted in more negative
sentiments.
Electronic Word-of-Mouth (eWOM):
Concentrated bursts of negative reviews and their
spread may lead to significant drops in sentiment
scores, while the spread of positive reviews can cause
scores to rise.
Late 2022 to Early 2023 Average sentiment scores drop again, and fluctuations
decrease, tending towards a stable negative trend.
Reoccurrence of Issues: There might have been
recurring product quality or service issues, leading to a
drop in sentiment scores.
Lowered User Expectations: Due to previous
negative impacts, user expectations may have lowered,
resulting in more negative reviews even without major
issues.
Confirmation Bias: Users may be inclined to make
reviews consistent with existing sentiment scores,
amplifying a particular sentiment trend.

33
Solution
Strategy Detailed Implementation
Monitor and Re- Establish a Data Platform: Set up a centralized data platform
spond that collects and processes user feedback in real-time. Integrate
data from various sources such as online reviews, social media com-
ments, and customer service interactions.
Implement Anomaly Detection Mechanisms: Develop algo-
rithms to detect significant deviations in sentiment scores. When a
cluster of negative sentiment is detected, the system should auto-
matically trigger alerts.
Real-Time Tracking and Response: Create dashboards that
provide real-time insights into user sentiment trends. Set up a ded-
icated team to monitor these dashboards and respond to negative
feedback promptly.
Targeted Product Optimization: Analyze negative feedback to
identify specific issues with the product or service. Use this infor-
mation to make targeted improvements. For instance, if many users
report a specific functionality issue, prioritize fixing it in the next
update.

Promotion and Identify Peak Sentiment Periods: Use historical data to iden-
Marketing tify periods when sentiment scores are typically high. Plan promo-
tional campaigns during these times to maximize their impact.
Highlight Positive Reviews: During marketing campaigns,
showcase positive user reviews and testimonials. This can enhance
credibility and attract more consumers.
Engage Influencers: Partner with influencers who have a posi-
tive view of your product. Their endorsements can amplify positive
sentiment and reach a wider audience.
Offer Incentives for Positive Feedback: Encourage satisfied
customers to leave positive reviews by offering incentives such as
discounts or loyalty points.

Continued on next page

34
Table 5.4 – continued from previous page
Strategy Detailed Implementation
Continuous Im- Regularly Collect Feedback: Use surveys, user interviews, and
provement feedback forms to gather continuous input from users. Ensure that
feedback collection is an ongoing process rather than a one-time
event.
Analyze Feedback for Insights: Use text analytics and senti-
ment analysis tools to extract actionable insights from user feed-
back. Identify common themes and recurring issues.
Implement a Feedback Loop: Establish a process for turn-
ing user feedback into actionable improvements. Prioritize changes
based on the impact on user satisfaction and the feasibility of im-
plementation.
Test and Iterate: Before rolling out major changes, test them
with a small user group to gather feedback and refine the improve-
ments. This iterative process ensures that the changes meet user
needs.
Communicate Improvements to Users: When you make im-
provements based on user feedback, communicate these changes to
users. This shows that you value their input and are committed to
enhancing their experience.

5.2.2 Further Analysis and Discussion


OLS

Based on the regression results, there is indeed a negative correlation between average
sentiment score and purchase count, but this negative correlation is not significant. The
correlation coefficient is -0.140007, and the R-squared value from the regression analysis
is 0.020, indicating that the average sentiment score only explains 2% of the variance in
purchase count. Additionally, the p-value is 0.114, which is greater than 0.05, suggesting
that the impact of sentiment score on purchase count is not statistically significant.

35
Figure 5.2: OLS on How SA Score Affects Purchase Number by Month

Diverse Experiences from High Purchase Volumes When the purchase quantity
of a product increases, the buyer group becomes more diverse. Among these buyers, some
may have higher expectations, or their needs and preferences might not completely align
with the product, leading to lower ratings.

Expectation Effect Popular products often come with high expectations. When peo-
ple purchase popular products, they tend to expect them to perfectly meet their needs. If
the product fails to meet these high expectations, buyers may give negative reviews.

Quality Control Challenges When the sales volume of a product increases signif-
icantly, the pressure on production and supply chains also rises. Quality control can
become more challenging, leading to some quality issues and defects. These problems are
likely to be reflected in negative reviews.

Increased Visibility of Negative Reviews The sheer volume of reviews for popular
products increases the visibility of negative feedback. Some buyers, after seeing existing

36
Figure 5.3: Timeseries of SA Score and Purchase Count

negative reviews, may become more aware of the product’s shortcomings and be more
inclined to leave negative feedback after their purchase.

Competitor and Malicious Reviews Hot-selling products often attract the atten-
tion of competitors, who may leave negative reviews to undermine the competition. Ad-
ditionally, some buyers might leave unfair negative reviews due to other reasons, such as
delivery issues or customer service problems.

Echo Effect In cases of high sales volumes, the exchange of opinions among buyers
becomes more frequent. If some early reviews are negative, subsequent buyers who read
these reviews may be influenced. Even if their experience is neutral or positive, they
might still be inclined to give a negative review.

37
PCA OLS

Figure 5.4: OLS with More Variables and PCA

1 df_monthly = df_filtered . groupby ([ ’ year ’ , ’ month ’ ]) . agg ({


2 ’ parent_asin ’: ’ count ’ , # Review Counts = Purchase Counts
3 ’ s e n t i m e n t _ s c o r e _ r o b e r t a ’: ’ mean ’ ,
4 ’ review_length ’: ’ mean ’ ,
5 ’ helpful_vote ’: ’ mean ’ ,
6 ’ rating ’: ’ mean ’ ,
7 ’ has_images ’: ’ mean ’ ,
8 ’ weekday ’: ’ mean ’ ,
9 ’ average_rating ’: ’ mean ’ ,
10 ’ rating_number ’: ’ mean ’
11 }) . reset_index ()
Listing 5.1: Variables Processing

38
Component Influencing Features Interpretation
x1 sc roberta, sc roberta sq, x1 can be seen as a combined score of sentiment and rat-
rating number ing numbers. A high x1 suggests high sentiment scores
and rating numbers.
x2 review length, helpful vote, x2 may represent the engagement and helpfulness aspect
has images of reviews. A high x2 indicates long, helpful reviews
with images.
x3 helpful vote, review length, x3 captures the temporal and helpfulness aspects of re-
year views. A high x3 indicates helpful, detailed reviews over
the years.
x4 rating, sc roberta sq, x4 may represent a periodic or seasonal sentiment and
month rating trend. A high x4 suggests high ratings and senti-
ment scores during certain months.
x5 weekday, average rating, x5 might capture weekly patterns in reviews and average
has images ratings. A high x5 indicates high average ratings on
specific days of the week.
x6 review length, helpful vote, x6 represents the detailed and helpful nature of reviews
rating along with their ratings. A high x6 indicates detailed,
highly-rated, and helpful reviews.
x7 has images, sc roberta sq, x7 might represent the visual aspect and sentiment of
weekday reviews on specific days. A high x7 indicates positive
reviews with images on certain weekdays.
x8 month, average rating, rat- x8 captures monthly trends in average ratings and rating
ing number numbers. A high x8 indicates high ratings and rating
numbers during certain months.

Table 5.5: Interpretation of Principal Components

Significant Predictors x1, x2, x3 are statistically significant and have a meaningful
impact on purchase count.

Marginally Significant Predictors x5, x6, and x8 are marginally significant, sug-
gesting a potential influence on purchase count.

Non-Significant Predictors x4 and x7 do not show significant contributions to the


model.

Partial Autocorrelation

From Figure 5.6, we can know that Partial Autocorrelation Coefficients after Lag 1 are
Close to 0 and within the blue confidence intervals. This means the partial autocorrela-
tions for these lags are not significant and suggests that the sentiment scores at different
lags do not have significant direct linear relationships. In other words, the current sen-

39
Figure 5.5: Partial Autocorrelation of Sentiment Socres for B00R1TAN7I

timent score is primarily influenced by the present time point rather than directly by
previous time points, which fits H1.

Seasonal Autoregressive Integrated Moving Average with eXogenous regres-


sors(SARIMAX)

Since the PACF plot shows no significant partial autocorrelation coefficients beyond lag
0, an high order autoregressive term may not be necessary. It is advisable to reduce the
order of the AR term, such as trying ARIMA(1, 1, 1), given that the time series may still
be non-stationary, retaining a first-order differencing term (d = 1) is reasonable. Also as
there are no significant lagged autocorrelations in the PACF plot, the MA term can be
kept at a low-order.

Parameter Estimates The AR(1) coefficient is not significant (P > 0.05), indicating
a weak linear relationship between the current value and the first lagged value.
The MA(1) coefficient is highly significant (P < 0.05), indicating a strong linear relation-
ship between the current value and the first lagged error term.
The residual variance sig2 is significant, indicating variability in the residuals.

40
Figure 5.6: SARIMAX

Model Diagnostics Ljung-Box Q Test (L1): The P value is much greater than 0.05,
indicating that the residuals do not have significant autocorrelation and are white noise.
Jarque-Bera (JB) Test: The P value is 0.00, indicating that the residuals deviate from
normal distribution, possibly showing slight non-normality.
Heteroskedasticity Test (H):The P value is greater than 0.05, indicating no significant
heteroskedasticity and that the residual variance is relatively stable.
Skewness: -0.09, indicating a slight left skew in the residual distribution.
Kurtosis: 1.71, slightly lower than 3, indicating the residual distribution is slightly less
peaked than a normal distribution.

Conclusion The AR(1) term is not significant, whereas the MA(1) term is highly
significant, suggesting that the current value of the time series is mainly influenced by
the previous error term rather than the previous value directly. Residuals are white
noise, with no significant autocorrelation. Residual variance is stable, with no significant
heteroskedasticity. Despite residuals deviating slightly from normality, this non-normality

41
typically has a minimal impact on the time series model.

Discussion Electronic Word-of-Mouth (eWOM): The significant MA(1) term indicates


that sentiment scores are influenced by prior errors, possibly reflecting the eWOM effect,
where sentiment scores at one time impact the sentiment errors in subsequent evaluations.
Consumer Emotional Reactions: The primary influence of the previous error term on
current sentiment scores suggests that users’ ratings are adjusted based on the overall
sentiment errors from the preceding period. Confirmation Bias: The non-significance of
the AR(1) term indicates that users’ ratings are relatively independent and not directly
influenced by the ratings of the preceding period, but rather by the preceding errors.

42
Chapter 6

Discussion

6.1 Summary of Findings


This research explored the evolution and applications of NLP, particularly focusing on
sentiment analysis using advanced transformer-based models like BERT and RoBERTa.
The literature review highlighted the significant advancements in NLP from rule-based
systems to sophisticated deep learning techniques. The introduction of transformer mod-
els, especially BERT and its optimized version RoBERTa, has revolutionized the field by
providing powerful tools to model complex language patterns.

In the empirical analysis, the application of RoBERTa to sentiment analysis on Amazon


reviews demonstrated its effectiveness in capturing nuanced sentiments and outperform-
ing traditional methods. The case study of the product B00R1TAN7I provided insights
into how sentiment scores fluctuate over time and highlighted the impact of product
quality, market adaptation, and electronic word-of-mouth on user reviews.

6.2 Implications for Theory and Practice


6.2.1 Theoretical Implications
NLP and Sentiment Analysis

The findings confirm the robustness of transformer-based models in sentiment analysis,


supporting their theoretical superiority over traditional methods. This emphasizes the
importance of continuous advancements in NLP techniques to capture the intricacies of
human language. Future research should continue to explore and optimize these technolo-
gies to further enhance the accuracy and application scope of sentiment analysis.

43
Behavioral Economics

The analysis of sentiment trends in the context of electronic word-of-mouth (eWOM),


consumer emotional reactions, and the confirmation bias underscores the interconnect-
edness of consumer behavior and online reviews. These behavioral economics principles
are vital for understanding how sentiment can influence purchasing decisions and overall
market dynamics. Integrating sentiment analysis with behavioral economics can provide
deeper insights into the psychological and social factors in consumer decision-making,
enriching consumer behavior theory.

6.2.2 Practical Implications


Product Management

The study provides actionable insights for product managers to monitor and respond to
sentiment trends effectively. Understanding the factors influencing sentiment scores can
help in making informed decisions about product improvements and marketing strategies.
For example, by identifying and addressing recurring issues in user feedback, product
managers can improve product quality and user satisfaction. Additionally, promptly
responding to negative reviews and engaging positively with customers can build brand
loyalty and trust.

Marketing Strategies

The role of positive and negative sentiment in shaping consumer perception highlights the
need for strategic marketing efforts. Leveraging positive reviews and managing negative
feedback promptly can significantly impact a product’s market success. Companies can
enhance their brand image by promoting positive reviews, using influencers, and engaging
with customers on social media. Additionally, establishing positive customer relationships
and providing excellent after-sales service can enhance customer satisfaction and word-
of-mouth promotion.

Continuous Improvement

Implementing a robust feedback loop based on sentiment analysis can lead to continuous
product and service enhancements. This approach ensures that consumer expectations
are met, thereby fostering loyalty and positive word-of-mouth. Companies should regu-
larly collect user feedback, analyze the data to identify improvement opportunities, and
take swift action. Moreover, transparently communicating improvements to users can

44
further build trust and satisfaction.

6.3 Limitations and Future Research Directions


While this study provides valuable insights, several limitations need to be addressed in
future research:

Data Limitations

The analysis was conducted on a specific dataset (Amazon reviews in the beauty cate-
gory). Future research could explore diverse datasets across different product categories
and platforms to generalize the findings. Different product categories may have distinct
user bases and review habits, and studying these differences can provide a more compre-
hensive sentiment analysis model. Additionally, cross-platform data integration can help
verify the universality and stability of the model.

Model Limitations

Although RoBERTa demonstrated high accuracy, exploring other advanced models like
GPT-3 or newer versions of BERT could provide additional insights and potentially better
performance. With the rapid development of NLP technology, new models and algorithms
are continually emerging. Researchers should stay updated on these technologies to ensure
the best performance in sentiment analysis.

Behavioral Insights

The study primarily focused on quantitative sentiment analysis. Incorporating qualita-


tive analyses and user interviews could provide deeper behavioral insights into why cer-
tain sentiments prevail at different times. By deeply understanding users’ emotions and
behavioral motivations, companies can develop more effective marketing strategies and
product improvement plans, further enhancing user satisfaction and brand loyalty.

Temporal Dynamics

Further research could explore more sophisticated time-series models to capture the tem-
poral dynamics of sentiment scores and their causal relationships with external factors
like market trends and promotional activities. Understanding the time-based patterns
and trends in sentiment scores can help businesses predict future consumer behavior and
adjust their strategies accordingly.

45
6.4 Conclusion
This research underscores the transformative impact of advanced NLP models on sen-
timent analysis. By leveraging transformer-based models, businesses can gain nuanced
insights into consumer sentiment, enabling more informed decision-making. The integra-
tion of behavioral economics principles further enriches our understanding of consumer
behavior in the digital age. As NLP technologies continue to evolve, their applications
in sentiment analysis and beyond will undoubtedly expand, offering new avenues for re-
search and practical innovation. The findings from this study provide a solid foundation
for future explorations in the intersection of NLP, sentiment analysis, and behavioral
economics, ultimately contributing to more effective and consumer-centric business prac-
tices.

46

You might also like