Comparative Analysis
of YouTube Comments
using ML Techniques
AKSHATA KOMAL 20010180
RIYA 20010080
SWAPNAJIT TRIPATHY 20010133
ASHUTOSH BEHERA 20010156
Under the Guidance of
Dr. Bichitrananda Behera
TABLE OF CONTENTS
01 02 03
Problem Statement Introduction Literature Surveys
04 05 06
Proposed Solution Result & Discussion Reference
with Block Diagram
Problem statement
With a growing YouTube channel, the influx of comments becomes valuable
viewer feedback, prompting the development of a sentiment analysis model to
classify comments automatically—offering insights for content enhancement,
engagement strategies, and effective moderation.
INTRODUCTION
YouTube's comments section is a crucial part of user interaction, allowing
content creators, businesses, and individuals to understand audience sentiment
and engagement. Machine learning techniques can classify comments into
positive, negative, or neutral sentiments, providing valuable insights for content
creators, marketers, and platform administrators. This understanding helps tailor
content to audience preferences, fostering viewer engagement and loyalty. It
also helps marketers assess advertising campaigns' effectiveness and maintain a
positive online environment. The project aims to create a machine learning
model that can autonomously process and classify comments based on
sentiment, providing rapid insights for content creators.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Classifying YouTube Comments
(Rhitabrat Pokharel, Dixit
Based on Sentiment and Type of SCIENCE DIRECT
1. 2021 Bhatta)
Sentence
MAJOR FINDINGS
METHODOLGY :Linear SVC, Logistic Regression, Multinomial NB, Random Forest, Decision Tree
DATASET : YouTube Reviews
MERIT: It Classified the comments using 5 different models on 2 feature selection methods. The
experiments showed that best scores for cross validation and 𝐹1 were obtained by Logistic Regression.
DEMERIT: The number of classes and sub-classes can be increased to represent a more comprehensive
comment classification. Likewise, the classification models and overall feature selection approach can be
further improved for the comments that belong to more than one class.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Sentiment Analysis of Positive and
Negative of YouTube Comments Using Nizar Muhammad , Saiful
IEEE Explore
2. 2019 Naïve Bayes – Support Vector Bukhori , Priza Pandunata
Machine (NBSVM) Classifier
MAJOR FINDINGS
METHODOLGY : Naïve Bayes & Support Vector Machine
DATASET : YouTube Reviews
MERIT: On the data obtained from YouTube video comments, the combination of Naïve Bayes and Support Vector Machine
methods produces better accuracy level and stronger performance with the use of 7:3 scale data, namely 70% training data
and 30% testing data. By producing the highest performance test values, namely precision of 91%, recall of 83% and f1
score of 87%.
DEMERIT: They have not explored the classification of different types of sentences (like imperative, question, corrective, and
sentimental). They are likewise limited to either sentiment analysis or question classification
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
How useful are your comments?
Stefan Siersdorfer, Sergiu
Analyzing and predicting
Chelaru, Wolfgang Nejdl, ResearchGate
3. 2020 YouTube comments and
and Jose San Pedro
comment rating,
MAJOR FINDINGS
METHODOLGY : Linear SVM & Thesaurus
DATASET : YouTube Comments
MERIT : It provides valuable merits including objective analysis, a practical predictive model, scalability
insights, statistical rigor, and actionable use cases, enhancing YouTube's user engagement and content
quality.
DEMERIT: It used linear support vector machine along with a thesaurus to obtain the degree of negativity
and positivity of each word from comments. The accuracy of their approach stands at 0.72. Even without
using any thesauruses, we were able to increase the accuracy of our model for the sentiment analysis.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Zhalaing Cheung, Khanh Linh
Experiments with Sentence
Phan, Ashesh Mahidadia, and Aclanthology
4. 2019 Classification.
Achim Hoffmann
MAJOR FINDINGS
METHODOLGY : Naive Bayes, Decision Tree, SVM
DATASET : Email conversation
MERIT :In most of the existing work, the sentences of the imperative class have not been researched
adequately. They performed experiments on different models for 14 different classes of sentences (including
imperative sentence types like request, instruction and suggestions).
DEMERIT: In their work, they only chose the standard response emails because these emails have well-
structured sentences and few grammatical errors. It eases the classification task.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Deep learning for sentence
A. Hassan and A. Mahmood Towards Data Science
5. 2022 classification
MAJOR FINDINGS
METHODOLGY : Recurrent NN
DATASET : IMDB Reviews , Stanford Sentiment Treebank
MERIT :
The ensemble learning-based model can help make better predictions than a single model trained
independently.
TCN is an excellent alternative to recurrent architecture and has been proven effective in classifying text
data
DEMERIT: They have not explored the classification of different types of sentences (like imperative,
question, corrective, and sentimental). They are likewise limited to either sentiment analysis or question
classification
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Twitter Sentiment Analysis using International Journal of
Ashwini M Joshi , Sameer
XGBoost and Logistic Regression: Computer Sciences and
6. 2019 Prabhune
A Hybrid Approach Engineering
MAJOR FINDINGS
METHODOLGY : XGBoost and Logistic Regression:
DATASET : Collection of Tweets
MERIT :
Twitter data often contain noise in the form of misspellings, slang, abbreviations, and emoticons. Random
Forest can handle such noisy data effectively, as it averages the predictions of multiple decision trees,
reducing the impact of outliers and irrelevant features.
XGBoost efficiency allows it to handle large datasets with high dimensionality, making it suitable for
processing the massive volume of tweets generated daily.
DEMERIT: It has limited ability to capture complex non-linear relationships, while XGBoost demerits include
potential overfitting and the need for careful hyperparameter tuning.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Twitter Sentiment Analysis using
International Journal of
XG Boost and Random Forest Ashwini M. Joshi, Sameer
Computer Sciences and
7. 2019 Classification Algorithms: A Prabhune, Nesara B R
Engineering
Hybrid Approach
MAJOR FINDINGS
METHODOLGY : XGBoost and Random Forest
DATASET : Collections of Tweets
MERIT :
XGBoost efficiency allows it to handle large datasets with high dimensionality, making it suitable for
processing the massive volume of tweets generated daily.
Twitter data often contain noise in the form of misspellings, slang, abbreviations, and emoticons. Random
Forest can handle such noisy data effectively, as it averages the predictions of multiple decision trees,
reducing the impact of outliers and irrelevant features.
DEMERIT: While Random Forest provides feature importance scores, the model's overall decision-making
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
A Pretrained YouTuber
Embeddings for Improving Ching-Wen Hsu, Hsuan Liu, and
8. 2021 Acalanthology
Sentiment Classification of Jheng-Long Wu
YouTube Comments
MAJOR FINDINGS
METHODOLGY : machine learning-based models (RandomForest, Xgboost, and SVM) and Deep
Learning(BERT).
DATASET : YouTube Comments
MERIT :SVMs perform well in high-dimensional spaces, making them suitable for tasks with a large number of
features. BERT captures contextual information from text by considering the entire input sequence
bidirectionally, leading to better representations of word meanings.
DEMERIT: BERT does not present a better prediction score on sentiment polarity problems.
LITERATURE SURVEY
s.n YEAR TITLE OF THE PAPER AUTHOR PUBLISHER
O
Machine learning-based approach
Mustafa Abdalrassual Jassim,
for sentiment analysis from
9. 2023 Dhafar Hamed Abd, Mohamed Research Square
recorded social media movie
Nazih Omr
reviews
MAJOR FINDINGS
METHODOLGY : Machine learning; XGBoost; Artificial neural network; stochastic gradient descent.
DATASET : IMDB Reviews
MERIT :XGBoost is a popular algorithm for gradient boosting, a technique used for improving the accuracy of
decision trees. XGBoost can handle both regression and classification problems. ANN is especially useful for
image and speech recognition, natural language processing, and other complex data processing tasks perform
updates on the model parameters with small batches of data, rather than using the entire dataset at once
DEMERIT: XGBoost: Less interpretable due to its complex ensemble of decision trees.
Artificial neural network: Prone to overfitting with large numbers of parameters and requires significant
computational resources for training.
Stochastic gradient descent: May converge to suboptimal solutions due to its dependence on the learning rate
and sensitive to feature scaling.
PROPOSED SOLUTION
● The outlined process involves:
● 1. Preprocessing: Collect diverse YouTube comments and clean them for consistency.
● 2. Feature Extraction: Use TF-IDF to identify important words in comments and convert text to numerical form.
● 3. Sentence Vector Representation: Capture context and meaning using advanced techniques like Word2Vec.
● 4. Training Multiple Machine Learning Models: Train models (e.g., Random Forest, Naive Bayes, KNN) on
labeled data for sentiment classification.
● 5. Model Selection: Choose the best-performing model based on metrics like accuracy and F1-score.
● 6. Model Saving: Save the selected model for future use.
● 7. Web Application Development: Create a user-friendly web app with Streamlit to input comments and provide
real-time sentiment analysis results.
PROPOSED
SOLUTION
FIG. 1: BLOCK
DIAGRAM
DATASET
● The dataset integrates data from diverse YouTube
categories, comprising around 50,000 annotated
remarks gathered from various sources to ensure a
diverse representation of comment types and
classifications.
CLASS CONTENT
Positive Valuation, Recognition
Negative Receiving reprimands for not adhering to the
instructions provided in the video
Figure 2: Number of comments in each
Neutral Factual descriptions, or statements without emotional class
overtones
Table 6: Classes of comments with content type
DATA PREPROCESSING
● The data pre-processing step handles the following factors that make the classification process
difficult:
(1) Non-standard language
(2) Spelling errors
(3) Unformatted texts
(4) Trivial comments
Above issues are common in platforms like YouTube because of the informal nature of
communication. We addressed these issues using the following pre-processing steps:
• lowercasing • removing emojis
• removing URLs • correcting spelling errors
• removing new line character ("\n“) • lemmatizing
• removing punctuations • removing stopword
• removing integers
FEATURE EXTRACTION
● We have selected well known techniques for vectorizing a corpus of text like Word2Vec, PCA
and Tf-Idf vectorizer s for this paper. Using these three methods we can study the behaviour of
different classification models.
1. Word2Vec:
● Word2Vec excels in sentiment analysis by capturing semantic relationships between words and
providing efficient vector representations, enhancing contextual understanding and improving
model performance in natural language processing tasks.
● Its ability to handle out-of-vocabulary words further contributes to its versatility in extracting
meaningful features from text data.
FEATURE EXTRACTION
2. TF-IDF
● It is advantageous in sentiment analysis for its ability to highlight the importance of
words by considering both their frequency and rarity across documents, enabling the
identification of key terms that contribute to sentiment.
● It effectively reduces the impact of common words, emphasizing the significance of
unique and relevant terms in characterizing the sentiment of text data.
3. PCA
● It is advantageous in sentiment analysis for reducing the dimensionality of feature
space, simplifying the representation of data while retaining its essential information.
● This leads to improved computational efficiency and a clearer understanding of the
underlying patterns in sentiment-related features.
RESULT
●TF-IDF:
Performance: The models with the highest accuracy and F1 scores were
Random Forest and Linear SVC, demonstrating their resilience in identifying
patterns in TF-IDF vectors.
Additionally, logistic regression performed well, demonstrating the data's
linear separability.
FIG 3. PERFORMANCE MATRIX OF TWO BEST TABLE
●Word2Vec:
RESULT
Performance: Word2Vec produced relatively lower accuracy and F1 scores
than the other models.
Among Word2Vec-based models, Logistic Regression performed the best
followed by Random Forest
FIG.4 PERFORMANCE MATRIX OF TWO BEST TABLE
RESULT
●PCA:
Performance: With moderate to good accuracy and F1 scores, PCA's
performance was consistent across models.
Higher performance was shown by Random Forest and Naïve Bayes when
using PCA-based features.
FIG.5 PERFORMANCE MATRIX OF TWO BEST TABLE
MODEL SELECTION
● TF-IDF performed best across models, indicating that it is more successful at
gathering pertinent data for classification.
● Random Forest's performance was comparatively consistent when experimenting
with different feature extraction techniques, demonstrating its adaptability to a
range of data representation formats.
DEVELOPMENT OF WEB APPLICATION
● Our web application's sentiment analysis procedure is centered around this saved
model.
● Users can enter their remarks for sentiment analysis in this application.
● The saved machine learning model is incorporated into the web application to
facilitate real-time sentiment analysis of comments submitted by users.
● The outcomes are then plainly shown, with a favorable, negative, or neutral
feeling .
Fig. 6,7,8: Website showing
whether comments are positive,
negative or neutral
CONCLUSION
● Efficiently categorizing comments into sections helps writers access diverse comment
formats. Distinguishing between positive and negative sentiments reflects audience
feelings, while question and suggestion categories aid producers in identifying viewer
queries and improving content.
● This streamlined approach eliminates the need for producers to navigate lengthy
comment threads, enhancing interaction.
● TF-IDF proves superior in gathering classification data, consistently effective. The
adaptable Random Forest model excels across feature extraction techniques, integrated
into a web app for real-time sentiment analysis.
● This user-friendly tool extracts insights from YouTube comments, providing a
consolidated platform for enhanced accessibility and actionable insights
REFERENCES
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors
for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150.
[Link]
Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, and Jose San Pedro. 2010. How Useful Are Your Comments? Analyzing and
Predicting YouTube Comments and Comment Ratings. In Proceedings of the 19th International Conference on World Wide Web
(Raleigh, North Carolina, USA) (WWW ’10). Association for Computing Machinery, New York, NY, USA, 891–900.
[Link] 1772690.1772781
Anthony Khoo, Yuval Marom, and David Albrecht. 2006. Experiments with Sentence Classification. In Proceedings of the
Australasian Language Technology Workshop 2006. Sydney, Australia, 18–25. [Link] U06-1005
A. Hassan and A. Mahmood. 2017. Deep learning for sentence classification. In 2017 IEEE Long Island Systems, Applications and
Technology Conference (LISAT). 1–5.
A. N. Muhammad, S. Bukhori and P. Pandunata, "Sentiment Analysis of Positive and Negative of YouTube Comments Using Naïve
Bayes – Support Vector Machine (NBSVM) Classifier," 2019 International Conference on Computer Science, Information
Technology, and Electrical Engineering (ICOMITEE), Jember, Indonesia, 2019, pp. 199-205, doi:
10.1109/ICOMITEE.2019.8920923