Detailed Report
Detailed Report
getting good results for our dataset. mining process and can serve to reduce our accuracy. So we are
removing them from our corpus. The NLTK package of Python
has a dictionary of stop words so using that we are removing
III. DATASET A NALYSIS
stop words from our corpus. We did not remove not and no
A. Data Acquisition words as we thought removing them could change the context
For this project we are using Amazon Review Baby Product of a sentence as ”Not Good” will become ”Good”.
Review Dataset given at http://jmcauley.ucsd.edu/data/amazon/. The • Removing Hyper Links:
dataset consists of 1,60,000 entries out of which we are using the We noticed that our reviews contained hyperlinks also. These are
review and the ratings attribute from our dataset. We acquired our again not needed for our resultant feature set. So we removed
dataset which was in JSON format. We labelled our dataset based on them using Beautiful Soup module.
the ratings. We labelled all the reviews which had ratings greater than • Removing Punctuations:
3 as 1 representing positive reviews and less than 3 as 0 representing Punctuation are also those objects which are not necessary for
negative reviews. We are removing all the reviews having rating equal our analysis. We are removing punctuations from our corpus
to 3 from our dataset considering them as neutral. with the help of regular expressions.
• Lemmitization:
When provided with words as input, lemmitization returns the
lemma i.e. the base word. By this we are ensuring to build a
meaningful corpus for our analysis.
C. Feature Extraction
(a) Features from the positive (b) Features from the negative
reviews reviews
Fig. 3: Review World Cloud
• Count Vectorizer: It converts a collection of text documents to • glOve: GloVe is an unsupervised learning algorithm from stan-
a matrix of token counts. This implementation builds a sparse ford for obtaining vector representations for words. Training
representation of the token occurrences. is performed on aggregated global word-word co-occurrence
• TF-IDF: Know as the term frequency-inverse document fre- statistics from a corpus, and the resulting representations show-
quency, is a numerical statistic that is intended to reflect how case interesting linear substructures of the word vector space
important a word is to a document in a collection or corpus. If https://nlp.stanford.edu/projects/glove/
a positive word occurs in a negative review multiple times or • t-SNE for Word Embedding Visualisation: Know as t-Distributed
vice versa, the weight of such words are reduced in the TF-IDF Stochastic Neighbor Embedding, a technique for dimensionality
representation. reduction that is particularly well suited for the visualization
of high-dimensional datasets. For our project we used t-SNE
2) Word Embedding: to visualize our high dimensional Word Embeddings. We came
The Word Embedding captures context of a word in a document, up with interesting visualisations on relations between vectors
semantic and syntactic similarity, relation with other words, etc.. captured from our texts.
They are a type of word representation that allows words with 1) From the figure 5(a) it can be seen that in the word2vec
similar meaning to have a similar representation[Example: cosine model less number of related words to ”love” are near to it
similarity].For our Word Embedding we are using two models: whereas in figure 5(b) we can see more number of similar
words being captured near to it.
• Word2Vec: Word2vec is a two-layer neural network that pro-
2) Similarly from the figure 6(a) and 6(b) for the word ”hate”
cesses the text. It takes the text corpus as input and output the
we see less number of words near to ”hate” in word2vec
feature vectors for all the words in that corpus. The purpose
model compared to the glOve model.
and usefulness of Word2vec is to group the vectors of similar
words together in vector space. That is, it detects similarities We use these trained model as the pre-trained embedding layer for
mathematically. Word2vec creates vectors that are distributed our LSTM network and we suspected that the GlOve model might
numerical representations of word features, features such as the produce us better results than the word2Vec model.
context of individual words https://skymind.ai/wiki/word2vec
B. Precision:
It is a number of correct positives our model predicts compared
to the total number of positives it predicts. Precision is a measure
of exactness, quality, or accuracy. High precision means that more
or all of the positive results predicted are correct. A precision score
of 1.0 means that every item labeled positive, does indeed belong to
TABLE IV: Results for Original Data Distribution for Count
the positive class. A precision score by itself though does not say
Vectorizer approach
anything about how many items of that class were not labeled. It is
defined as follows:
TP
P recision = (2)
TP + FP
C. Recall:
Recall is the number of positives that our model predicts compared
to the actual number of positives in our data. Recall is a measure of TABLE V: Results for Original Data Distribution for TF-IDF
completeness. High recall means that our model classified most or approach
all of the possible positive elements as positive. A recall score of
1.0 means that every item from that class was labeled as belonging
to that class. However, having just the recall score, we cannot know There are several classifiers used in our experiment like Support
how many other items were incorrectly labeled. Vector Machine, Multinomial Naive Bayes, K-Nearest Neighbour,
TP Logisitic Regression and Adaboosting. Additionally we used the
Recall = (3) LSTM Recurrent Neural Network for our Word Embedding. From all
TP + FN
the experiments it can be seen that after getting hyper parameters as
mentioned in the tables through GridSearchCV, Linear Support Vector
D. F1 Score: machine with TF-IDF approach for the original data distribution gave
the highest result with testing accuracy of 93.39% as well as in Count
Precision and recall are often used together because they comple- Vectorization SVM performed much better than the other classifiers
ment each other in how they describe the effectiveness of a model. achieving testing accuracy of 92.72%. We also found that Logistic
The F1-score ombines these two as the weighted harmonic mean of Regression being the fastest in computation came second best with
precision and recall. 87.8% testing accuracy for Count Vectorizer and 88.64% testing
accuracy for TF-IDF for balanced data distribution. Additionally
2 ∗ (P recision ∗ Recall)
F1 = (4) for balanced data distribution it showed 92.71% testing accuracy
(P recision + Recall) for CountVectorizer and 93.26% testing accuracy for TF-IDF and
it almost gave as good result as SVM.
6 ECE657A GROUP 29 PROJECT REPORT, WINTER-2019
However kNN gave extremely bad results in balanced data distribu- VIII. C ONCLUSION
tion for both Count Vectorizer and TF-IDF with 63.48% and 58.26% In this project, we performed a supervised learning approach for
testing accuracy respectively. Though it gave a little good result in the detecting the polarity of our reviews in our dataset. We classified
original data distribution as expected in our initial hypothesis thereby our reviews for both balanced and original data distribution. After
failing to generalize. Considering the balanced data distribution for 5 fold cross validation for evaluating our approaches we came to
the LSTM approach in the Word Embedding we got good result with some interesting results. Based on our results we found that LSTM
glOve model achieving 92.38% testing accuracy. However we did not approach using GloVe embedding performed the best for our dataset
get results as expected for the word2vec model. The word2vec for in both balanced and original distribution of the data. In terms of
the LSTM gave result just as good as the best classifier i.e. SVM. In BoW approach SVM with TF-IDF outperforms all other classifiers.
original data distribution both glOve and word2vec model performed Its worthy noting that MNB and LR classifiers computation time was
much better than the other classifiers showing 94.25% and 95.05% extremely fast and provided decent results though lesser than our best
testing accuracy respectively. From the figure 8 and figure 9 the classifier. On the whole we arrive to conclusion that KNN is the worst
training and the testing accuracy can be seen and it can be observed performing model in this kind of application due to the high variance
that LSTM with glove model performed the best. in the data. Also we saw that the distribution of ratings in the data
has a meaningful impact on model performance where the original
distribution gave us better performance than the balanced data.
IX. F UTURE W ORK
In future we would like to apply our techniques for a Multiclass
Classification for ratings between (1-5). Also we will try to see
the performance of classifiers using Over-Sampling techniques. Our
future work also include to perform a text summarization of the
reviews. We would also like to improve our models by applying
hyper-parameter tuning and adding more LSTM layers. We would
like to see how the model behaves for reviews which are sarcastic
or of longer length than our assumption stated in this scope of our
project.
R EFERENCES
[1] James Barry. Sentiment analysis of online reviews using bag-of-words
and lstm approaches. In AICS, 2017.
Fig. 8: Training and Testing accuracy for balanced data [2] Maria Soledad Elli and Yi-Fan Wang. Amazon reviews, business analytics
with sentiment analysis.
[3] Xing Fang and Justin Zhan. Sentiment analysis using product review
data. volume 2, page 5. Springer, 2015.
[4] Andrew Goldberg. Cs838-1 advanced nlp: Automatic summarization.
Madison: University of Winsconsin-Madison, 2007.
[5] Tanjim Ul Haque, Nudrat Nawal Saber, and Faisal Muhammad Shah.
Sentiment analysis on large scale amazon product reviews. In Innovative
Research and Development (ICIRD), 2018 IEEE International Conference
on, pages 1–6. IEEE, 2018.
[6] https://towardsdatascience.com/beyond-accuracy-precision-and recall.
Evaluvation metrics definitions. www.towardsdatascience.com, 2016.
[7] Yi Sun Mingxiang Chen. Sentimental analysis with amazon review data.
Stanford University, 2016.
[8] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sen-
timent classification using machine learning techniques. In Proceedings
of the ACL-02 conference on Empirical methods in natural language
processing-Volume 10, pages 79–86. Association for Computational Lin-
guistics, 2002.
[9] Qinxia Wang, X Wu, and Y Xu. Sentiment analysis of yelps ratings based
on text reviews. 2016.
Fig. 9: Training and testing accuracy for original distribution