0% found this document useful (0 votes)
7 views38 pages

Fake News Detection Model Using Machine Techniques

This document presents a final year project by Ahmed Umar on developing a fake news detection model using machine learning techniques. The study aims to leverage various classification algorithms to effectively distinguish between real and fake news articles, utilizing a dataset from Kaggle. The results indicate that while all models performed well, the Decision Tree model achieved the highest accuracy, though it showed signs of overfitting, highlighting the need for further improvements in model robustness.

Uploaded by

au699298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Fake News Detection Model Using Machine Techniques

This document presents a final year project by Ahmed Umar on developing a fake news detection model using machine learning techniques. The study aims to leverage various classification algorithms to effectively distinguish between real and fake news articles, utilizing a dataset from Kaggle. The results indicate that while all models performed well, the Decision Tree model achieved the highest accuracy, though it showed signs of overfitting, highlighting the need for further improvements in model robustness.

Uploaded by

au699298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FAKE NEWS DETECTION MODEL USING MACHINE

LEARNING TECHNIQUES

BY

AHMED UMAR
(20/03/03/034)

DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE, FACULTY OF


SCIENCE
BORNO STATE UNIVERSITY, MAIDUGURI

DECEMBER, 2024
FAKE NEWS DETECTION MODEL USING MACHINE

LEARNING TECHNIQUES

BY

AHMED UMAR

20/03/03/034

BEING A FINAL YEAR PROJECT SUBMITTED TO THE DEPARTMENT OF


MATHEMATICS AND COMPUTER SCIENCE, FACULTY OF SCIENCE, BORNO
STATE UNIVERSITY MAIDUGURI, BORNO STATE, IN PARTIAL FULFILMENT
OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELOR
OF SCIENCE IN COMPUTER SCIENCE

DECEMBER 2024

i
DECLARATION
I, Ahmed Umar, solemnly declare that I did the project titled "Fake News Detection Model
Using Learning Machine Techniques" under the supervision of Mal. Abdullahi Isa. This
project hasn't been submitted for any other degree elsewhere. I've acknowledged all the
sources I used by referencing them.

……………………………… ………………………………

Signature Date

………………………………. …………………………………

Signature. Date

ii
CERTIFICATION
This is to certify that Ahmed Umar (20/03/03/034) have completed his final year project titled:
'’ Fake News Detection Model Using Machine Learning Techniques '’ submitted to the
Department of Mathematics and Computer Science, Borno State University, Maiduguri. Meets
the regulations governing the award of the degree of [Link] in Computer Science and hereby
approved-under my close guidance and supervision. This work has not been summitted
elsewhere for the award of any other degree.

…………………………………………. ……………………………

Mal. Abdullahi Isa. Signature/Date

(Supervisor)

……….............................................. ..………………………….

Prof. Shu’aibu Garba Ngulde Signature/Date

(Head of Department)

---------------------------------------------------- -------------------------
Prof. P.B Zirra Signature/Date
(External Examiner)

iii
DEDICATION
I dedicate this project to my beloved parents, Hon. Umar Yaro Bida and Hajja Aja Muhammad,
whose unwavering support, guidance, and love have been a constant source of inspiration and
motivation throughout my life and academic journey.

iv
ACKNOWLEDGEMENT

In the Name of Allah, the Most Gracious, the Most Merciful. All praise is due to Almighty
Allah (Subanahu wa Ta’ala), the Sustainer and Cherisher of the Universe, who has granted me
the strength, wisdom, and guidance to undertake and complete this project. Without His
boundless mercy and blessings, none of this would have been possible.

My profound appreciation goes to my project supervisor in the person of Mal. Abdullahi Isa
and coordinator Malm. Aishatu Ibrahim Birma for their invaluable guidance, patience and
advice. May Allah (SWT) reward them for their efforts in nurturing knowledge.

I extend my sincere gratitude to my beloved parents, whose prayers, sacrifices, and unwavering
support have been my greatest source of motivation. May Allah (SWT) reward them. To my
siblings Dr. Modu Buzu Umar, Lawan Mustapha Umar, Muhammad Aja Umar, Mohammed
Umar (Ibnul Qayyum), Maina Umar and Falmata Umar, thank you for being a source of
encouragement and joy throughout this journey. To my wonderful friends, am deeply thankful
for their constant encouragement and companionship throughout this journey. To my mentors
Abubakar Musa Saulawa and Abubakar Sadiq your wisdom and inspiration have been a
guiding light in my academic and personal growth. Lastly, I express my appreciation to my
lecturers, whose dedication and knowledge have shaped me into the person I am today. May
Allah (SWT) bless and reward you all. Ameen

v
TABLE OF CONTENTS

DECLARATION .................................................................................................................................. ii

CERTIFICATION ............................................................................................................................... iii

DEDICATION ..................................................................................................................................... iv

ACKNOWLEDGEMENT.................................................................................................................... v

TABLE OF CONTENTS .................................................................................................................... vi

ABSTRACT ........................................................................................................................................... x

CHAPTER ONE ................................................................................................................................... 1

INTRODUCTION ................................................................................................................................ 1
1.1 BACKGROUND OF THE STUDY ............................................................................................. 1
1.2 STATEMENT OF THE PROBLEM ............................................................................................ 2
1.3 AIM AND OBJECTIVES............................................................................................................. 2
1.4 SCOPE AND LIMITATIONS OF THE STUDY ......................................................................... 2
1.5 SIGNIFICANCE OF THE STUDY .............................................................................................. 3
1.6 OPERATIONAL DEFINITION OF TERMS ............................................................................... 3

CHAPTER TWO .................................................................................................................................. 4

LITERATURE REVIEW .................................................................................................................... 4


2.1 INTRODUCTION ........................................................................................................................ 4
2.2 IMPORTANCE OF FAKE NEWS DETECTION ....................................................................... 4
2.3 MACHINE LEARNING APPROACHES.................................................................................... 4
2.4 RELATED WORK ....................................................................................................................... 5

CHAPTER THREE .............................................................................................................................. 7

METHODOLOGY ............................................................................................................................... 7
3.1 INTRODUCTION ........................................................................................................................ 7
3.2 DATASET DESCRIPTION ......................................................................................................... 7
3.3 PROPOSED SYSTEM REQUIREMENT .................................................................................... 8
3.3.1 HARDWARE REQUIREMENT ............................................................................................... 8
3.3.2 SOFTWARE REQUIREMENT ................................................................................................ 8

vi
3.4 PROPOSED OVERVIEW MODEL AND METHODS ............................................................... 8
3.4.2 DATA DIVISION ...................................................................................................................... 9
3.4.3 FEATURE EXTRACTION ....................................................................................................... 9
3.4.4 MODEL SELECTION AND TRAINING ............................................................................... 10
3.4.5 MODEL EVALUATION ........................................................................................................ 10

CHAPTER FOUR............................................................................................................................... 11

RESULTS AND DISCUSSION ......................................................................................................... 11


4.1 RESULTS AND DISCUSSION ................................................................................................. 11
4.2 CLASSIFICATION MODELS ................................................................................................... 11
4.2.1 LOGISTIC REGRESSION ...................................................................................................... 11
4.2.2 RANDOM FOREST CLASSIFIER......................................................................................... 12
4.2.3 DECISION TREE CLASSIFIER............................................................................................. 13
4.3 COMPARISON OF THE ALGORITHMS ................................................................................ 14
4.4 PERFORMANCE EVALUATION ............................................................................................ 14
4.4.1 PRECISION ............................................................................................................................. 15
4.4.2 RECALL .................................................................................................................................. 16
4.4.3 ACCURACY ........................................................................................................................... 16
4.4.4 F1-SCORE .............................................................................................................................. 16
4.5 REAL -WORLD TESTING........................................................................................................ 17

CHAPTER FIVE ................................................................................................................................ 18

SUMMARY, CONCLUSION AND RECOMMENDATION ......................................................... 18


5.1 SUMMARY ................................................................................................................................ 18
5.2 CONCLUSION ........................................................................................................................... 18
5.3 RECOMMENDATIONS FOR FUTURE WORKS ................................................................... 18

REFERENCES ................................................................................................................................... 19

APPENDIX A (SOURCE CODE) ..................................................................................................... 20

APPENDIX B ...................................................................................................................................... 27

(Outcomes)........................................................................................................................................... 27

vii
LIST OF TABLES
TABLE 1 DATASET DESCRIPTION ..................................................................................... 7
TABLE 2 COMPARISON OF ALGORITHMS USING DIFFERENT METRICS ............... 14

viii
LIST OF FIGURES

FIGURE 1 WORKING METHODOLOGY OF THE PROPOSED SYSTEM......................... 8


FIGURE 2 CONFUSION MATRIX FOR LOGISTIC REGRESSION ALGORITHM ......... 12
FIGURE 3 CONFUSION MATRIX FOR RANDOM FOREST CLASSIFIER .................... 12
FIGURE 4 CONFUSION MATRIX FOR RANDOM FOREST CLASSIFIER .................... 13
FIGURE 5 COMPARISON OF ALGORITHMS USING DIFFERENT METRICS ............. 14
FIGURE 6 CONFUSION MATRIX ....................................................................................... 15

ix
ABSTRACT

The rise of the digital age has facilitated the rapid dissemination of information, but it has also
amplified the spread of fake news, which can undermine societal trust, distort public opinion,
and lead to harmful consequences. Addressing this issue requires automated and efficient
detection systems. This study focuses on leveraging machine learning (ML) techniques to
develop a robust fake news detection model capable of accurately distinguishing between real
and fake news articles.
A publicly available dataset from Kaggle, comprising 44,921 labeled news articles, was used
for this study. Data preprocessing techniques, including removal of irrelevant characters,
punctuation, and stop words, as well as normalization of text, were applied to ensure data
quality. Features were extracted using term frequency-inverse document frequency (TF-IDF)
vectorization, which transformed textual data into numerical representations suitable for ML
analysis. The dataset was divided into training (80%) and testing (20%) subsets to evaluate
model performance.
Three ML classification algorithms Logistic Regression, Decision Tree, and Random Forest
were implemented and trained. Their performance was assessed using evaluation metrics such
as accuracy, precision, recall, F1-score, and confusion matrices. Results revealed that all
models achieved high accuracy, with the Decision Tree model achieving the highest at 99.96%.
However, the Decision Tree exhibited signs of overfitting, limiting its ability to generalize to
new, unseen data. Logistic Regression, with an accuracy of 98.74%, demonstrated balanced
performance across all metrics and outperformed other models in real-world testing scenarios.
The Random Forest model with an accuracy of 98.76% also performed well but faced
challenges in specific misclassification instances.
The study highlights the potential of machine learning techniques to combat the spread of fake
news effectively. While the selected algorithms showed promising results, further
improvements such as hyperparameter tuning, data augmentation, and the integration of
ensemble methods could enhance the robustness and reliability of these models. The findings
underscore the importance of using multiple performance metrics to evaluate model
effectiveness comprehensively.

x
CHAPTER ONE

INTRODUCTION

1.1 Background of the Study


The rise of the digital age has significantly changed how information is disseminated, allowing
for quick sharing across various platforms, including social media, blogs, and news websites.
While this accessibility has its benefits, it also creates considerable challenges, particularly the
spread of fake news—misleading or false information presented as legitimate news content.
The term "fake news" gained prominence during the 2016 U.S. presidential election, where
misinformation campaigns influenced public opinion and voter behavior (Ghafoor ., 2022).

Fake news can take many forms, including entirely fabricated stories, sensationalized
headlines, and manipulated images or videos designed to mislead readers or generate clicks
(Zhang ., 2021). The repercussions of fake news are extensive; they can undermine trust in
media institutions, distort public perception on critical issues such as health and politics, and
even incite violence or unrest within communities (Adebimpe ., 2023). For instance, during the
COVID-19 pandemic, misinformation regarding the virus's origins and treatment options led
to public confusion and contributed to health crises worldwide. Given the speed at which
information spreads online, traditional fact-checking methods often prove inadequate in
effectively combating fake news (Jain A., 2019). Manual verification is time-consuming and
struggles to keep pace with the rapid dissemination of false information across social networks.
Consequently, there is an urgent need for automated solutions capable of identifying fake news
with high accuracy. Machine learning (ML) algorithms have emerged as powerful tools for
detecting fake news by analyzing textual content and identifying patterns that distinguish
credible sources from unreliable ones (Mahmud et al., 2021). ML techniques can process vast
amounts of data quickly and efficiently, making them well-suited for this task. Typically, the
process involves training machine learning models on large datasets containing examples of
both real and fake news articles (Ahmed, 2017). These models learn to recognize features
associated with misinformation—such as specific linguistic patterns, emotional language
usage, and source credibility—enabling them to classify new articles as either true or false.
Natural language processing (NLP) plays a crucial role in this context by transforming textual
data into numerical representations that machine learning algorithms can interpret (Norman,
2023). Techniques such as tokenization, stemming, lemmatization, and vectorization are
employed to prepare the data for analysis. Once the data is pre-processed, various ML

1
models—such as Naïve Bayes, Logistic Regression, Random Forests, and more advanced deep
learning architectures like Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) networks—can be trained effectively to detect fake news. The effectiveness
of these models depends on several factors: the quality of the training data, the choice of
features used for classification, and the specific algorithms employed (Thaher T., 2021). Recent
advancements in deep learning have shown promising results in improving detection accuracy
by capturing complex patterns within text data that traditional methods may overlook (Khan .,
2022).

1.2 Statement of the Problem


Fake news is spreading online like wildfire and it is causing real-world problems. We need
better tools to stop it. There are many different machine learning techniques or algorithms
(Random Forest, Support Vector Machine, decision tree etc,) to spot fake news, but we don’t
know which one is best. To address the issue of fake news, this project utilizes machine learning
techniques specially classification algorithms,

1.3 Aim and Objectives


This study aims to utilize machine learning techniques to develop an effective fake news
detection model. The objectives of this study include:

i. Select and apply optimal machine learning algorithms for fake news detection.
ii. Implement and train the selected algorithms on relevant dataset.
iii. Evaluate the performance of the implemented algorithms.

1.4 Scope and Limitations of the Study


This study focuses on developing a fake news detection model using machine learning
techniques. Specifically, it explores the application of classification algorithms, namely
Logistic Regression, Random Forest and Decision Tree to distinguish between real and fake
news articles using a publicly available dataset from Kaggle, comprising a diverse range of real
and fake news articles. The study is limited to utilizing the aforementioned three classification
algorithms for fake news detection. while machine learning offers a wide range of algorithms

2
for this task, this project concentrates on the performance of these selected algorithms,
evaluated using metrics such as accuracy, precision, recall, F1-score and confusion matrix.

1.5 Significance of the Study


This project is crucial in combating the growing problem of fake news, which can have serious
consequences for individuals and society. By identifying effective machine learning techniques
to detect misinformation, this study will empower people to make informed decision and
protect themselves from being misled. This study can inform policymakers, social media
platforms, and individuals about the best strategies to combat fake news, ultimately
contributing to a more trustworthy and reliable digital landscape.

1.6 Operational Definition of terms


Fake News: fake news refers to news messages that contain wrong or false information but
do not report the incorrectness of information.
Machine Learning: Machine learning is a subfield of artificial intelligence (AI) that involves
training computer systems to learn from data without being explicitly programmed.
News article: A unit of text content that can be classified as real or fake.
Label: A binary variable indicating whether a news article is real (1) or fake (0).
Dataset: A collection of news articles with corresponding labels.
Algorithm: A set of rules or instructions used to process data and make predictions.
Model: A trained instance of an algorithm that can be used to make prediction on new data.
Training set: A subset of the dataset used to train the model.
Testing set: A subset of the dataset used to evaluate the model’s performance.
Accuracy: The proportion of correct predictions.
Precision: The proportion of correct positive predictions.
Recall: The proportion of positive cases correctly identified.
F1-score: The harmonic means of precision and recall.

3
CHAPTER TWO

LITERATURE REVIEW

2.1 Introduction
Fake news encompasses various forms of misinformation that can mislead readers or
manipulate public opinion (Zhang & Wang, 2021). This includes entirely fabricated stories
designed to deceive readers into believing false narratives or sensationalized headlines that
exaggerate facts to attract attention (Ghafoor et al., 2022). The rise of social media platforms
has facilitated the rapid spread of such misinformation; users can share content without
verifying its authenticity.

2.2 Importance of Fake News Detection


The ability to accurately identify fake news is crucial for maintaining an informed citizenry
and fostering trust in media institutions (Adebimpe et al., 2023). Misinformation can lead to
harmful consequences; for example, misleading information about vaccines during health
crises can result in public hesitancy toward vaccination efforts (Khan et al., 2021). Moreover,
political misinformation can influence elections and policy decisions by distorting public
perceptions.

2.3 Machine Learning Approaches


Numerous studies have investigated ML techniques for detecting fake news:

i. Naïve Bayes: This probabilistic classifier is favored for its simplicity and effectiveness
in text classification tasks (Rashid & Khan, 2020). It applies Bayes' theorem to estimate
the likelihood that a given article belongs to a specific category based on its features.

ii. Support Vector Machines (SVM): SVM has demonstrated high accuracy in
differentiating between real and fake news articles by identifying the optimal
hyperplane that separates different classes (Zhang & Wang, 2021). It is particularly
effective in high-dimensional spaces typical of text data.

iii. Deep Learning: Neural networks such as recurrent neural networks (RNNs) and
convolutional neural networks (CNNs) have been utilized due to their ability to capture

4
complex patterns within data (Khan et al., 2022). These models can develop
hierarchical representations of text data that enhance classification performance.

iv. Ensemble Methods: Techniques like Random Forests combine multiple classifiers to
improve accuracy and robustness (Ahmed & Mahmood, 2020). By aggregating
predictions from various models, ensemble methods help reduce overfitting and
improve generalization.

2.4 Related Work


The domain of fake news detection through machine learning has received significant
contributions from researchers striving to create effective strategies for addressing
misinformation online. Ghafoor et al. (2022) performed an extensive analysis comparing
different machine learning techniques for identifying fake news on social media platforms.
Their research highlighted the value of hybrid approaches that integrate multiple algorithms to
improve detection accuracy beyond what individual models could achieve. In another
important study by Khan et al. (2021), a benchmark was established comparing various
machine learning models for online fake news detection across diverse datasets. This study
revealed performance differences among models like SVMs and deep learning techniques such
as CNNs and LSTMs when applied to various types of misinformation. Thaher et al. (2021)
investigated intelligent detection methods specifically aimed at false information in Arabic
tweets using hybrid feature selection techniques combined with machine learning models.
Their findings indicated that language-specific nuances could be effectively captured through
customized algorithms designed for particular linguistic contexts. In a practical application-
focused study by Adebimpe et al. (2023), researchers implemented a Long Short-Term
Memory (LSTM) model for detecting fake news within Nigerian social media contexts using
indigenous datasets sourced from local newspapers alongside Kaggle datasets. Their results
showed that LSTM outperformed traditional machine learning models like SVMs in terms of
accuracy while achieving an impressive average detection rate exceeding 92%. Furthermore,
recent developments have seen researchers exploring deep learning frameworks that combine
CNNs with boosted trees to enhance feature extraction capabilities when analysing textual data
related to fake news detection tasks (ProjectPro, 2024). This hybrid strategy allows for more
robust classification outcomes by leveraging strengths from both convolutional architectures
and decision tree methodologies. Another emerging area involves utilizing network analysis
techniques alongside NLP methods; Java point (2024) emphasizes how examining networks of
social media accounts disseminating specific pieces of information can uncover patterns

5
indicative of coordinated disinformation campaigns often associated with fake news
propagation. Additionally, fact-checking databases play a crucial role in enhancing machine
learning capabilities; algorithms can cross-reference claims made within articles against
verified facts stored within these databases—this approach enables systems not only to classify
content but also assess its credibility based on factual accuracy checks conducted automatically
during processing stages.

6
CHAPTER THREE

METHODOLOGY

3.1 INTRODUCTION
This chapter focuses on the suggested project design, research approach, and methodology.
we have chosen to use for this study. This chapter covers the operational framework as a whole.
In this chapter, we will discuss preprocessing steps, model training and testing, performance
evaluation, data analysis, models utilized to meet study objectives, and the dataset chosen from
Kaggle.

3.2 DATASET DESCRIPTION

The dataset utilized in this study is referred to as “real and fake news.” The dataset was obtained
from Kaggle. We have imported all the required python libraries (NumPy, pandas, Matplotlib,
re). This dataset was taken from Kaggle in December 2017, with a total 44921 records and
three attributes of tittle, text and label. The table below gives the full description of the dataset.
Table 1 Dataset Description

S/NO TITLE TEXT LABEL

1 University Of Texas Police This is how police should Fake


Kick Humiliated Nazis Off deal with Nazis and white
Campus Before They supremacists. The tiki
Could Start Trouble torches were doused early at
the University…

2 Facebook Just Confirmed question of how big a role Fake


Russia Spent At Least social media played in the
$100k On Ads to Influence dissemination of fake…
Election

3 Saudi police release DUBAI (Reuters) - A 14- Real


teenager detained for year-old boy who was
dancing in street detained by Saudi police for
dancing to the song…

4 Silicon Valley blasts Senate would mean is every month, Real


proposal to tax startup when your equity
options
compensation vests…

7
3.3 PROPOSED SYSTEM REQUIREMENT
3.3.1 Hardware Requirement
To ensure the successful completion of this project with improved performance, the necessary
specifications include:
i. Processor: Intel with a speed of at least 1.80GHz
ii. Memory: 4 GB RAM
iii. Disk space: 500 GB

3.3.2 Software requirement


The software support for the design of the proposed system involves:
Operating System: Windows, Linux or Ubuntu.
Platform: Jupyter Notebook o Google Colab.

3.4 PROPOSED OVERVIEW MODEL AND METHODS

Dataset (fake & Dataset from


Real) Kaggle

preprocessing Remove
irrelevant
character

Division of Training 80%,


Dataset Testing 20%

Feature TF, IDF


Extraction Vectorization

Training Model LR, DT, RF,

Accuracy, Recall,
Evaluation F1, Precision

Figure 1 Working Methodology of the proposed system

8
3.4.1 DATA PREPROCESSING

The dataset is mainly comprised of numerical data. They also provide additional information
which I do not require. Eliminate any irrelevant information and meticulously cleanse the
dataset. We remove some unnecessary data down.

Special characters: lack specific meanings and can interfere with the analysis process. To
avoid this, we get rid of them.

Uppercase to Lowercase: computers recognize uppercase and lowercase characters in distinct


manners. To guarantee consistency and prevent issues with predictions, I convert all the text to
lowercase.

Punctuation marks: We are conscious that punctuation marks such as question marks, colons,
commas, and exclamation points are present in all documents. "Error will be eliminated by
removing etc. from the code as it is a common practice in programming.

Stop words: are a set of common words in a particular language. Rephrased text: The rapid
expansion of the internet has transformed how we communicate and obtain information.

3.4.2 DATA DIVISION


Our dataset comprises 44921 entries with both real and fake news articles. The dataset is split
into 80% for training and 20% for testing to ensure accurate results.

3.4.3 FEATURE EXTRACTION


Feature extraction converts textual data into numerical formats that machine learning
algorithms can analyse effectively harnessing large datasets containing labelled examples from
multiple sources ensures comprehensive training opportunities exist across varying styles
present within real-world scenarios encountered daily online when consuming information via
digital channels such as social media platforms or traditional media outlets allowing subsequent
analyses conducted upon them subsequently leading into model training phases ultimately
resulting improved classification outcomes achieved thereafter once trained appropriately
based upon these refined inputs received initially prior entering subsequent stages outlined
below further enhancing robustness overall performance exhibited.

9
3.4.4 Model Selection and Training
Various models can be trained on the extracted features:

i. Logistic Regression: A simple yet effective model for binary classification tasks.

ii. Random Forest: An ensemble method that reduces overfitting by averaging multiple
decision trees.

iii. Deep Learning Models: Implement RNNs or CNNs for capturing sequential
dependencies in text data.

Training involves splitting the dataset into training and testing subsets typically using an
80/20 split allowing subsequent analyses conducted upon them subsequently leading into
model training phases ultimately resulting improved classification outcomes achieved
thereafter once trained appropriately based upon these refined inputs received initially prior
entering subsequent stages outlined below further enhancing robustness overall performance
exhibited

3.4.5 Model Evaluation


Model performance is evaluated using several metrics:

• Accuracy: The proportion of correctly classified instances.

• Precision: The ratio of true positive predictions to the total predicted positives.

• Recall: The ratio of true positive predictions to the total actual positives.

• F1 Score: The harmonic mean of precision and recall providing balance between two
metrics allowing comprehensive evaluation conducted upon them subsequently
leading into model training phases ultimately resulting improved classification
outcomes achieved thereafter once trained appropriately based upon these refined
inputs received initially prior entering subsequent stages outlined below further
enhancing robustness overall performance exhibited

10
CHAPTER FOUR

RESULTS AND DISCUSSION


The aim of this study was to evaluate the effectiveness of Decision Tree, Random Forest, and
Logistic Regression algorithms in detecting fake news. In this chapter we will discuss the
results of our experiment.

4.1 RESULTS AND DISCUSSION


Python Jupyter Notebook and other supporting libraries were used for data cleansing,
visualization, pre-processing, and machine learning modelling in this project. The results of
this study demonstrate the effectiveness of machine learning techniques in detecting fake news
articles. The performance of Logistic Regression (LR), Decision Tree (DT), Random Forest
Classifier (RFC) were evaluated using metrics such as accuracy, precision, recall, F1-score and
Confusion matrix.

4.2 CLASSIFICATION MODELS


Classification is a type of machine learning algorithm that predicts a category or label for a
given instance. For example, in the context of fake news detection, classification models are
employed to predict whether a news article is fake or real. I evaluated the results of three
classification models; Decision Tree, Random Forest, and Logistic regression. The figures
below show the Confusion Matrix of each model.

4.2.1 LOGISTIC REGRESSION


The logistic Regression algorithm achieved an accuracy of 98%, indicating its ability to
accurately classify fake and real news articles. The algorithm demonstrated high precision and
recall scores for both fake and real news articles. Precision refers to the proportion of true
positive among all predicted positive instances, while recall refers to the proportion of true
positives among all actual positive instances. The logistic Regression algorithm’s high
precision and recall scores indicate that it is effective in detecting both fake and real news
articles. The F1-score for the logistic Regression algorithm was also high, indicating that it
achieved a good balance between precision and recall.

11
Figure 2 Confusion Matrix for Logistic Regression Algorithm

4.2.2 RANDOM FOREST CLASSIFIER


The Random Forest Classifier achieved an accuracy of 98% indicating its ability to accurately
classify fake and real news articles. The algorithm demonstrated high precision and recall
scores for both fake and real news articles. However, the Random Forest Classifier’s was not
strong as the Logistic Regression algorithm’s performance.

Figure 3 Confusion Matrix for Random Forest Classifier

12
4.2.3 DECISION TREE CLASSIFIER
The Decision Tree Classifier achieved an impressive accuracy of 99%, indicating its ability to
accurately classify fake and real news articles. However, a closer examination of the confusion
matrix revealed that the algorithm may be prone to overfitting. Overfitting occur when a model
is too complex and learns the noise in the training data, resulting in poor generalization to new,
unseen data. The Decision Tree Classifier’s high accuracy score may be due to its ability to fit
the training data closely, but this may not translate to good performance on new, unseen data.

Figure 4 Confusion Matrix for Random Forest Classifier

13
4.3 COMPARISON OF THE ALGORITHMS
Table 2 Comparison of algorithms using different metrics

Algorithms Accuracy Precision Recall F1-score


Logistic Regression 98.74% 99% 99% 99%
Random Forest 98.76% 99% 99% 99%
Decision Tree 99.96% 100% 100% 100%

Figure 5 Comparison of Algorithms using different metrics

4.4 PERFORMANCE EVALUATION


The performance of the three machine learning models, Decision Tree, Logistic Regression
and Random Forest, was evaluated using a comprehensive set of performance metrics,
including accuracy, precision, recall, F1-score and confusion matrix. The purpose of this
evaluation was to determine which model performed best in detecting fake news articles. The

14
results showed that all three models achieved high accuracy scores, indicating their potential
effectiveness in fake news detection tasks. The outcomes achieved for each measure is obtained
with the help of Confusion Matrix (CM). CM is a very popular measure used while solving
classification problems, it can be applied to binary classification as well as for multiclass
classification problems. The sample of CM is presented in Figure 4.5 Where each cell shows
the values of TP, FP, FN, and TN. TP stands for True Positive which indicates the number of
positive examples classified accurately. The term FP shows False Positive value which
represents the number of actual negative examples classified as positive, and FN means False
Negative values which is the number of actual positive examples classified as negative and the
last TN indicates True Negative which show the number of negative examples classified
accurately.

Figure 6 Confusion Matrix

4.4.1 PRECISION

It is the connection between positive observations and positive observations that has been
accurately predicted. Precision is also known as positive predictive value and is the proportion
of relevant instances among the retrieved instances. It (also called positive predictive value) is
the fraction of relevant instances among the retrieved instances. The equation 1 show the
formula of Precision.

15
𝑻𝑷
Precision = (1)
𝑻𝑷 +𝑭𝑷

4.4.2 Recall

Recall is the percentage of documents that are successfully retrieved in order to extract
information. In binary classification, remembering is referred to as sensitivity. The possibility
of the query returning a relevant document can be considered. Recall, also known as the
sensitivity, hit rate, or the true positive rate (TPR), is the proportion of the total amount of
relevant instances that were actually retrieved. we define recall as the number of true positives
divided by the number of true positives plus the number of false negatives. The equation 2
show the formula of Recall.

𝑇𝑃
Recall = 𝑇𝑃 + 𝐹𝑁 (2)

4.4.3 Accuracy

Accuracy is a criterion for evaluating classification models. Informally, accuracy is the


percentage of our model’s observations that were correct. This is simply equal to the proportion
of predictions that the model classified correctly. It is one metric for evaluating classification
models. Informally, accuracy is the fraction of predictions our model got right. The equation 3
show the formula of Accuracy.

𝑇𝑃+
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝑇𝑃 + 𝐹𝑃+𝑇𝑁+𝐹𝑁 (3)

4.4.4 F1-score

F1 score is also known as f-measure or f-score, it takes both the precision and recall into
consideration in order to calculate the performance of and algorithm. Mathematically it is the
harmonic mean of precision and recall. Following is the equation of F1-measure. F1 score is
also a measure of a test’s accuracy, it is the harmonic mean of precision and recall. It can have
a maximum score of 1 (perfect precision and recall) and a minimum of 0. Overall, it is a
measure of the preciseness and robustness of a model. It is a measure of a model’s accuracy on
a dataset. It is used to evaluate binary classification systems. Equation 4 shows the formula for
finding the F1 Measure.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙
F-score = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (4)

16
4.5 REAL -WORLD TESTING

To evaluate the performance of the models in real-world scenarios, manual testing was
conducted using news article from reputable website such as BBC (British Broadcasting
Commission), Daily Trust and other news sources. The results of the real-world testing
revealed that the Decision Tree Classifier misclassified real news articles as fake news, and
fake news article as real news. Similarly Random Forest Classifier also misclassified real news
article as fake news. In contrast, the logistic Regression algorithm performed relatively well in
the real-word testing, with fewer misclassifications.

17
CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATION

5.1 SUMMARY
Fake news detection is a critical task in today’s information rich environment. This study evaluated the
performance of three machine learning models Decision Tree, Logistic Regression, and Random Forest
in detecting fake news article. To evaluate the models, we utilized performance metrics such as
accuracy, precision, recall, f1-score and confusion matrices. The results showed that all three models
achieved high accuracy scores, but the Decision Tree algorithm was prone to overfitting. Real-word
testing revealed that the Decision Tree and Random Forest algorithms misclassified real news as fake
news articles.

5.2 CONCLUSION
The study concludes that while machine learning models can be effective in detecting fake
news articles, they require careful evaluation and tunning to ensure that they generalize well to
new, unseen data. The findings highlight the importance of considering multiple performance
metrics, including accuracy, precision, recall and F1-score, when evaluating machine learning
models for fake news detection tasks.

5.3 RECOMMENDATIONS FOR FUTURE WORKS


➢ Model Selection: The Logistic Regression algorithm is recommended for fake news
detection tasks due to its balanced performance and ability to handle high-dimensional
data.
➢ Hyperparameter Tunning: Further hyperparameter tunning should be performed to
optimize the performance of the selected model.
➢ Data Augmentation: Data augmentation techniques should be explored to increase the
size and diversity of the training dataset.
➢ Ensemble Methods: Ensemble methods, such as bagging and boosting, should be
explored to improve the performance of the model.
➢ Real-word Testing: the selected model should be tested on real-world data to evaluate
its performance in practical scenarios.

18
REFERENCES

Adebimpe ., A. O.-H. (2023). ''Long Short-Term Memory Model Fake News Detection
Nigeria. lanna journal interdiscisplinary Studies
vol5(1),[Link]:https//[Link]/10.28991/ESJ-2023-07-04-015, pp167-180.
Ahmed H., T. I. (2017). ''Detection Online Fake News Using N-Gram Analysis Machine
Learning Techniques''. In Intelligent SecureDenpendable Systems Distributed Cloud
Environments ISDDC2017 Lecture Notes Computer Science vol10618. cham
Switzerland: Springer doi; 10.1007/978-3-3319-69155-8-9.
E, N. (2023). ''Detecting Fake News Using Machine Learning''. journal Student Research,
vol12(1) doi:[Link]
Ghafoor ., J. A. (2022). ''Fake News identification Social Media Using Machine Learning
Techniques''. Proceedings International Conference Information Technology
Application (p. Lecture NotesNetworks Systems Vol350). Singapore: Springer
doi:10.1007/978-981-16-7618-5_8.
Jain A., K. R. (2019). ''A Smart System Fake News DetectionUsing Machine Learning''.
ResearchGate doi:[Link]
Khan ., &. A. (2022). ''Deep Learning Techniques Fake News Detection Social Media''. IEEE
Access Vol10 pp12358 doi:[Link]
Thaher T., S. M. (2021). ''Interlligent Detection False Information Arabic Tweets Utilizing
Hybrid Harris Hawks Based Selection Machine Learning Models''. Symmetry
vol139(4), pp556.
Zhang ., &. W. (2021). ''A Support Vector Machine Approach Fake News Detection.''. Journal
information Science vol47(1), pp56-67.

19
APPENDIX A
(source code)

IMPORTING LABRARIES

import pandas as pd

import numpy as np

fake = pd.read_csv('[Link]')

true= pd.read_csv('[Link]')

[Link]()

[Link]()

true['label'] = 1

fake['label'] = 0

[Link]()

[Link]()

news = [Link]([fake,true],axis =0)

[Link]()

[Link]()

[Link]().sum()

news = [Link]( ['title','subject','date'], axis =1)

[Link]()

news=[Link](frac=1)

[Link]()

news.reset_index(inplace=True)

[Link]()

20
[Link](['index'], axis=1, inplace=True)

[Link]()

import re

def wordopt(text):

text = [Link]()

text = [Link](r'https?://\S+|www\.\S+','',text)

text = [Link](r'[^\w\s]', '', text)

text = [Link](r'<.*?>', '',text)

text = [Link](r'\d', '', text)

text = [Link](r'\n', '', text)

return text

news['text'] = news['text'].apply(wordopt)

news['text']

x = news['text']

y = news['label']

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

x_train.shape

x_test.shape

21
FEATURE EXTRACTION

from sklearn.feature_extraction.text import TfidfVectorizer


vectorization = TfidfVectorizer()

xv_train = vectorization.fit_transform(x_train)

xv_test = [Link](x_test)

xv_train

xv_test

MODELING TRAINING

from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()

[Link](xv_train, y_train)

pred_lr = [Link](xv_test)

[Link](xv_test, y_test)

from [Link] import classification_report

print(classification_report(y_test, pred_lr))

from [Link] import DecisionTreeClassifier

DT = DecisionTreeClassifier()

[Link](xv_train, y_train)

pred_dt = [Link](xv_test)

[Link](xv_test, y_test)

print(classification_report(y_test, pred_dt))

from [Link] import RandomForestClassifier

RFC = RandomForestClassifier(random_state=0)

[Link](xv_train, y_train)

22
pred_rfc = [Link](xv_test)

[Link](xv_test, y_test)

print(classification_report(y_test, pred_rfc))

from [Link] import confusion_matrix

import seaborn as sns

import [Link] as plt

cm = confusion_matrix(y_test, [Link](xv_test))

[Link](figsize=(7,6))

[Link](cm,annot=True, fmt='d', cmap='Greens', xticklabels=['Fake', 'Real'],


yticklabels=['Fake', 'Real'])

[Link]('Predicted Labels')

[Link]('True Labels')

[Link]('Confusion Matrix for Logistic Regression Model')

[Link]()

from [Link] import confusion_matrix

import seaborn as sns

import [Link] as plt

cm = confusion_matrix(y_test, [Link](xv_test))

[Link](figsize=(7,6))

[Link](cm, annot=True, fmt='d', cmap='Greens', cbar=False)

[Link]('Predicted Labels')

[Link]('True Labels')

[Link]('Confusion Matrix for Random Forest Model')

[Link]()

23
from [Link] import confusion_matrix

import seaborn as sns

import [Link] as plt

cm = confusion_matrix(y_test, [Link](xv_test))

[Link](figsize=(7,6))

[Link](cm, annot=True, fmt='d', cmap='Greens', cbar=False)

[Link]('Predicted values')

[Link]('True Labels')

[Link]('Confusion Matrix for Decision Tree')

[Link]()

import [Link] as plt

import numpy as np

# Sample data (replace with your actual data)

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']

LR_scores = [98.74, 99, 99, 99]

DTC_scores = [99.96, 100, 100,100]

RFC_scores = [98.76, 99, 99, 99]

# Create a figure and axes

fig, ax = [Link](figsize=(10, 7))

# Plot the bars

24
width = 0.30

x = [Link](len(metrics))

[Link](x, LR_scores, width, label='LR',color ='red')

[Link](x + width, DTC_scores, width, label='DTC',color ='green')

[Link](x + 2 * width, RFC_scores, width, label='RFC')

# Add percentage labels on top of the bars

for i in range(len(metrics)):

[Link](x[i], LR_scores[i] + 2, f'{LR_scores[i]}%', ha='center')

[Link](x[i] + width, DTC_scores[i] + 2, f'{DTC_scores[i]}%', ha='center')

[Link](x[i] + 2 * width, RFC_scores[i] + 2, f'{RFC_scores[i]}%', ha='center')

# Set labels and title

ax.set_xticks(x + width)

ax.set_xticklabels(metrics)

ax.set_ylabel('Score')

ax.set_title('Algorithms Performance Comparison')

[Link]()

# Show the plot

[Link]()

def output_lable(n):

if n == 0:

25
return "Fake News"

elif n == 1:

return "Real News"

import pandas as pd

def manual_testing(news):

testing_news = {"text":[news]}

new_def_test = [Link](testing_news)

new_def_test["text"] = new_def_test["text"].apply(wordopt)

new_x_test = new_def_test["text"]

new_xv_test = [Link](new_x_test)

pred_LR = [Link](new_xv_test)

pred_DT = [Link](new_xv_test)

pred_RFC = [Link](new_xv_test)

return print("\n\nLR Prediction: {} \nDT Prediction: {} \nRFC Prediction:


{}".format(output_lable(pred_LR[0]),
output_lable(pred_DT[0]),

output_lable(pred_DT[0]),

output_lable(pred_RFC[0])))

news = str(input())

manual_testing(news)

26
APPENDIX B

(Outcomes)

27

You might also like