Fake News Detection Model Using Machine Techniques
Fake News Detection Model Using Machine Techniques
LEARNING TECHNIQUES
BY
AHMED UMAR
(20/03/03/034)
DECEMBER, 2024
FAKE NEWS DETECTION MODEL USING MACHINE
LEARNING TECHNIQUES
BY
AHMED UMAR
20/03/03/034
DECEMBER 2024
i
DECLARATION
I, Ahmed Umar, solemnly declare that I did the project titled "Fake News Detection Model
Using Learning Machine Techniques" under the supervision of Mal. Abdullahi Isa. This
project hasn't been submitted for any other degree elsewhere. I've acknowledged all the
sources I used by referencing them.
……………………………… ………………………………
Signature Date
………………………………. …………………………………
Signature. Date
ii
CERTIFICATION
This is to certify that Ahmed Umar (20/03/03/034) have completed his final year project titled:
'’ Fake News Detection Model Using Machine Learning Techniques '’ submitted to the
Department of Mathematics and Computer Science, Borno State University, Maiduguri. Meets
the regulations governing the award of the degree of [Link] in Computer Science and hereby
approved-under my close guidance and supervision. This work has not been summitted
elsewhere for the award of any other degree.
…………………………………………. ……………………………
(Supervisor)
……….............................................. ..………………………….
(Head of Department)
---------------------------------------------------- -------------------------
Prof. P.B Zirra Signature/Date
(External Examiner)
iii
DEDICATION
I dedicate this project to my beloved parents, Hon. Umar Yaro Bida and Hajja Aja Muhammad,
whose unwavering support, guidance, and love have been a constant source of inspiration and
motivation throughout my life and academic journey.
iv
ACKNOWLEDGEMENT
In the Name of Allah, the Most Gracious, the Most Merciful. All praise is due to Almighty
Allah (Subanahu wa Ta’ala), the Sustainer and Cherisher of the Universe, who has granted me
the strength, wisdom, and guidance to undertake and complete this project. Without His
boundless mercy and blessings, none of this would have been possible.
My profound appreciation goes to my project supervisor in the person of Mal. Abdullahi Isa
and coordinator Malm. Aishatu Ibrahim Birma for their invaluable guidance, patience and
advice. May Allah (SWT) reward them for their efforts in nurturing knowledge.
I extend my sincere gratitude to my beloved parents, whose prayers, sacrifices, and unwavering
support have been my greatest source of motivation. May Allah (SWT) reward them. To my
siblings Dr. Modu Buzu Umar, Lawan Mustapha Umar, Muhammad Aja Umar, Mohammed
Umar (Ibnul Qayyum), Maina Umar and Falmata Umar, thank you for being a source of
encouragement and joy throughout this journey. To my wonderful friends, am deeply thankful
for their constant encouragement and companionship throughout this journey. To my mentors
Abubakar Musa Saulawa and Abubakar Sadiq your wisdom and inspiration have been a
guiding light in my academic and personal growth. Lastly, I express my appreciation to my
lecturers, whose dedication and knowledge have shaped me into the person I am today. May
Allah (SWT) bless and reward you all. Ameen
v
TABLE OF CONTENTS
DECLARATION .................................................................................................................................. ii
DEDICATION ..................................................................................................................................... iv
ACKNOWLEDGEMENT.................................................................................................................... v
ABSTRACT ........................................................................................................................................... x
INTRODUCTION ................................................................................................................................ 1
1.1 BACKGROUND OF THE STUDY ............................................................................................. 1
1.2 STATEMENT OF THE PROBLEM ............................................................................................ 2
1.3 AIM AND OBJECTIVES............................................................................................................. 2
1.4 SCOPE AND LIMITATIONS OF THE STUDY ......................................................................... 2
1.5 SIGNIFICANCE OF THE STUDY .............................................................................................. 3
1.6 OPERATIONAL DEFINITION OF TERMS ............................................................................... 3
METHODOLOGY ............................................................................................................................... 7
3.1 INTRODUCTION ........................................................................................................................ 7
3.2 DATASET DESCRIPTION ......................................................................................................... 7
3.3 PROPOSED SYSTEM REQUIREMENT .................................................................................... 8
3.3.1 HARDWARE REQUIREMENT ............................................................................................... 8
3.3.2 SOFTWARE REQUIREMENT ................................................................................................ 8
vi
3.4 PROPOSED OVERVIEW MODEL AND METHODS ............................................................... 8
3.4.2 DATA DIVISION ...................................................................................................................... 9
3.4.3 FEATURE EXTRACTION ....................................................................................................... 9
3.4.4 MODEL SELECTION AND TRAINING ............................................................................... 10
3.4.5 MODEL EVALUATION ........................................................................................................ 10
CHAPTER FOUR............................................................................................................................... 11
REFERENCES ................................................................................................................................... 19
APPENDIX B ...................................................................................................................................... 27
(Outcomes)........................................................................................................................................... 27
vii
LIST OF TABLES
TABLE 1 DATASET DESCRIPTION ..................................................................................... 7
TABLE 2 COMPARISON OF ALGORITHMS USING DIFFERENT METRICS ............... 14
viii
LIST OF FIGURES
ix
ABSTRACT
The rise of the digital age has facilitated the rapid dissemination of information, but it has also
amplified the spread of fake news, which can undermine societal trust, distort public opinion,
and lead to harmful consequences. Addressing this issue requires automated and efficient
detection systems. This study focuses on leveraging machine learning (ML) techniques to
develop a robust fake news detection model capable of accurately distinguishing between real
and fake news articles.
A publicly available dataset from Kaggle, comprising 44,921 labeled news articles, was used
for this study. Data preprocessing techniques, including removal of irrelevant characters,
punctuation, and stop words, as well as normalization of text, were applied to ensure data
quality. Features were extracted using term frequency-inverse document frequency (TF-IDF)
vectorization, which transformed textual data into numerical representations suitable for ML
analysis. The dataset was divided into training (80%) and testing (20%) subsets to evaluate
model performance.
Three ML classification algorithms Logistic Regression, Decision Tree, and Random Forest
were implemented and trained. Their performance was assessed using evaluation metrics such
as accuracy, precision, recall, F1-score, and confusion matrices. Results revealed that all
models achieved high accuracy, with the Decision Tree model achieving the highest at 99.96%.
However, the Decision Tree exhibited signs of overfitting, limiting its ability to generalize to
new, unseen data. Logistic Regression, with an accuracy of 98.74%, demonstrated balanced
performance across all metrics and outperformed other models in real-world testing scenarios.
The Random Forest model with an accuracy of 98.76% also performed well but faced
challenges in specific misclassification instances.
The study highlights the potential of machine learning techniques to combat the spread of fake
news effectively. While the selected algorithms showed promising results, further
improvements such as hyperparameter tuning, data augmentation, and the integration of
ensemble methods could enhance the robustness and reliability of these models. The findings
underscore the importance of using multiple performance metrics to evaluate model
effectiveness comprehensively.
x
CHAPTER ONE
INTRODUCTION
Fake news can take many forms, including entirely fabricated stories, sensationalized
headlines, and manipulated images or videos designed to mislead readers or generate clicks
(Zhang ., 2021). The repercussions of fake news are extensive; they can undermine trust in
media institutions, distort public perception on critical issues such as health and politics, and
even incite violence or unrest within communities (Adebimpe ., 2023). For instance, during the
COVID-19 pandemic, misinformation regarding the virus's origins and treatment options led
to public confusion and contributed to health crises worldwide. Given the speed at which
information spreads online, traditional fact-checking methods often prove inadequate in
effectively combating fake news (Jain A., 2019). Manual verification is time-consuming and
struggles to keep pace with the rapid dissemination of false information across social networks.
Consequently, there is an urgent need for automated solutions capable of identifying fake news
with high accuracy. Machine learning (ML) algorithms have emerged as powerful tools for
detecting fake news by analyzing textual content and identifying patterns that distinguish
credible sources from unreliable ones (Mahmud et al., 2021). ML techniques can process vast
amounts of data quickly and efficiently, making them well-suited for this task. Typically, the
process involves training machine learning models on large datasets containing examples of
both real and fake news articles (Ahmed, 2017). These models learn to recognize features
associated with misinformation—such as specific linguistic patterns, emotional language
usage, and source credibility—enabling them to classify new articles as either true or false.
Natural language processing (NLP) plays a crucial role in this context by transforming textual
data into numerical representations that machine learning algorithms can interpret (Norman,
2023). Techniques such as tokenization, stemming, lemmatization, and vectorization are
employed to prepare the data for analysis. Once the data is pre-processed, various ML
1
models—such as Naïve Bayes, Logistic Regression, Random Forests, and more advanced deep
learning architectures like Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) networks—can be trained effectively to detect fake news. The effectiveness
of these models depends on several factors: the quality of the training data, the choice of
features used for classification, and the specific algorithms employed (Thaher T., 2021). Recent
advancements in deep learning have shown promising results in improving detection accuracy
by capturing complex patterns within text data that traditional methods may overlook (Khan .,
2022).
i. Select and apply optimal machine learning algorithms for fake news detection.
ii. Implement and train the selected algorithms on relevant dataset.
iii. Evaluate the performance of the implemented algorithms.
2
for this task, this project concentrates on the performance of these selected algorithms,
evaluated using metrics such as accuracy, precision, recall, F1-score and confusion matrix.
3
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
Fake news encompasses various forms of misinformation that can mislead readers or
manipulate public opinion (Zhang & Wang, 2021). This includes entirely fabricated stories
designed to deceive readers into believing false narratives or sensationalized headlines that
exaggerate facts to attract attention (Ghafoor et al., 2022). The rise of social media platforms
has facilitated the rapid spread of such misinformation; users can share content without
verifying its authenticity.
i. Naïve Bayes: This probabilistic classifier is favored for its simplicity and effectiveness
in text classification tasks (Rashid & Khan, 2020). It applies Bayes' theorem to estimate
the likelihood that a given article belongs to a specific category based on its features.
ii. Support Vector Machines (SVM): SVM has demonstrated high accuracy in
differentiating between real and fake news articles by identifying the optimal
hyperplane that separates different classes (Zhang & Wang, 2021). It is particularly
effective in high-dimensional spaces typical of text data.
iii. Deep Learning: Neural networks such as recurrent neural networks (RNNs) and
convolutional neural networks (CNNs) have been utilized due to their ability to capture
4
complex patterns within data (Khan et al., 2022). These models can develop
hierarchical representations of text data that enhance classification performance.
iv. Ensemble Methods: Techniques like Random Forests combine multiple classifiers to
improve accuracy and robustness (Ahmed & Mahmood, 2020). By aggregating
predictions from various models, ensemble methods help reduce overfitting and
improve generalization.
5
indicative of coordinated disinformation campaigns often associated with fake news
propagation. Additionally, fact-checking databases play a crucial role in enhancing machine
learning capabilities; algorithms can cross-reference claims made within articles against
verified facts stored within these databases—this approach enables systems not only to classify
content but also assess its credibility based on factual accuracy checks conducted automatically
during processing stages.
6
CHAPTER THREE
METHODOLOGY
3.1 INTRODUCTION
This chapter focuses on the suggested project design, research approach, and methodology.
we have chosen to use for this study. This chapter covers the operational framework as a whole.
In this chapter, we will discuss preprocessing steps, model training and testing, performance
evaluation, data analysis, models utilized to meet study objectives, and the dataset chosen from
Kaggle.
The dataset utilized in this study is referred to as “real and fake news.” The dataset was obtained
from Kaggle. We have imported all the required python libraries (NumPy, pandas, Matplotlib,
re). This dataset was taken from Kaggle in December 2017, with a total 44921 records and
three attributes of tittle, text and label. The table below gives the full description of the dataset.
Table 1 Dataset Description
7
3.3 PROPOSED SYSTEM REQUIREMENT
3.3.1 Hardware Requirement
To ensure the successful completion of this project with improved performance, the necessary
specifications include:
i. Processor: Intel with a speed of at least 1.80GHz
ii. Memory: 4 GB RAM
iii. Disk space: 500 GB
preprocessing Remove
irrelevant
character
Accuracy, Recall,
Evaluation F1, Precision
8
3.4.1 DATA PREPROCESSING
The dataset is mainly comprised of numerical data. They also provide additional information
which I do not require. Eliminate any irrelevant information and meticulously cleanse the
dataset. We remove some unnecessary data down.
Special characters: lack specific meanings and can interfere with the analysis process. To
avoid this, we get rid of them.
Punctuation marks: We are conscious that punctuation marks such as question marks, colons,
commas, and exclamation points are present in all documents. "Error will be eliminated by
removing etc. from the code as it is a common practice in programming.
Stop words: are a set of common words in a particular language. Rephrased text: The rapid
expansion of the internet has transformed how we communicate and obtain information.
9
3.4.4 Model Selection and Training
Various models can be trained on the extracted features:
i. Logistic Regression: A simple yet effective model for binary classification tasks.
ii. Random Forest: An ensemble method that reduces overfitting by averaging multiple
decision trees.
iii. Deep Learning Models: Implement RNNs or CNNs for capturing sequential
dependencies in text data.
Training involves splitting the dataset into training and testing subsets typically using an
80/20 split allowing subsequent analyses conducted upon them subsequently leading into
model training phases ultimately resulting improved classification outcomes achieved
thereafter once trained appropriately based upon these refined inputs received initially prior
entering subsequent stages outlined below further enhancing robustness overall performance
exhibited
• Precision: The ratio of true positive predictions to the total predicted positives.
• Recall: The ratio of true positive predictions to the total actual positives.
• F1 Score: The harmonic mean of precision and recall providing balance between two
metrics allowing comprehensive evaluation conducted upon them subsequently
leading into model training phases ultimately resulting improved classification
outcomes achieved thereafter once trained appropriately based upon these refined
inputs received initially prior entering subsequent stages outlined below further
enhancing robustness overall performance exhibited
10
CHAPTER FOUR
11
Figure 2 Confusion Matrix for Logistic Regression Algorithm
12
4.2.3 DECISION TREE CLASSIFIER
The Decision Tree Classifier achieved an impressive accuracy of 99%, indicating its ability to
accurately classify fake and real news articles. However, a closer examination of the confusion
matrix revealed that the algorithm may be prone to overfitting. Overfitting occur when a model
is too complex and learns the noise in the training data, resulting in poor generalization to new,
unseen data. The Decision Tree Classifier’s high accuracy score may be due to its ability to fit
the training data closely, but this may not translate to good performance on new, unseen data.
13
4.3 COMPARISON OF THE ALGORITHMS
Table 2 Comparison of algorithms using different metrics
14
results showed that all three models achieved high accuracy scores, indicating their potential
effectiveness in fake news detection tasks. The outcomes achieved for each measure is obtained
with the help of Confusion Matrix (CM). CM is a very popular measure used while solving
classification problems, it can be applied to binary classification as well as for multiclass
classification problems. The sample of CM is presented in Figure 4.5 Where each cell shows
the values of TP, FP, FN, and TN. TP stands for True Positive which indicates the number of
positive examples classified accurately. The term FP shows False Positive value which
represents the number of actual negative examples classified as positive, and FN means False
Negative values which is the number of actual positive examples classified as negative and the
last TN indicates True Negative which show the number of negative examples classified
accurately.
4.4.1 PRECISION
It is the connection between positive observations and positive observations that has been
accurately predicted. Precision is also known as positive predictive value and is the proportion
of relevant instances among the retrieved instances. It (also called positive predictive value) is
the fraction of relevant instances among the retrieved instances. The equation 1 show the
formula of Precision.
15
𝑻𝑷
Precision = (1)
𝑻𝑷 +𝑭𝑷
4.4.2 Recall
Recall is the percentage of documents that are successfully retrieved in order to extract
information. In binary classification, remembering is referred to as sensitivity. The possibility
of the query returning a relevant document can be considered. Recall, also known as the
sensitivity, hit rate, or the true positive rate (TPR), is the proportion of the total amount of
relevant instances that were actually retrieved. we define recall as the number of true positives
divided by the number of true positives plus the number of false negatives. The equation 2
show the formula of Recall.
𝑇𝑃
Recall = 𝑇𝑃 + 𝐹𝑁 (2)
4.4.3 Accuracy
𝑇𝑃+
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝑇𝑃 + 𝐹𝑃+𝑇𝑁+𝐹𝑁 (3)
4.4.4 F1-score
F1 score is also known as f-measure or f-score, it takes both the precision and recall into
consideration in order to calculate the performance of and algorithm. Mathematically it is the
harmonic mean of precision and recall. Following is the equation of F1-measure. F1 score is
also a measure of a test’s accuracy, it is the harmonic mean of precision and recall. It can have
a maximum score of 1 (perfect precision and recall) and a minimum of 0. Overall, it is a
measure of the preciseness and robustness of a model. It is a measure of a model’s accuracy on
a dataset. It is used to evaluate binary classification systems. Equation 4 shows the formula for
finding the F1 Measure.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙
F-score = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (4)
16
4.5 REAL -WORLD TESTING
To evaluate the performance of the models in real-world scenarios, manual testing was
conducted using news article from reputable website such as BBC (British Broadcasting
Commission), Daily Trust and other news sources. The results of the real-world testing
revealed that the Decision Tree Classifier misclassified real news articles as fake news, and
fake news article as real news. Similarly Random Forest Classifier also misclassified real news
article as fake news. In contrast, the logistic Regression algorithm performed relatively well in
the real-word testing, with fewer misclassifications.
17
CHAPTER FIVE
5.1 SUMMARY
Fake news detection is a critical task in today’s information rich environment. This study evaluated the
performance of three machine learning models Decision Tree, Logistic Regression, and Random Forest
in detecting fake news article. To evaluate the models, we utilized performance metrics such as
accuracy, precision, recall, f1-score and confusion matrices. The results showed that all three models
achieved high accuracy scores, but the Decision Tree algorithm was prone to overfitting. Real-word
testing revealed that the Decision Tree and Random Forest algorithms misclassified real news as fake
news articles.
5.2 CONCLUSION
The study concludes that while machine learning models can be effective in detecting fake
news articles, they require careful evaluation and tunning to ensure that they generalize well to
new, unseen data. The findings highlight the importance of considering multiple performance
metrics, including accuracy, precision, recall and F1-score, when evaluating machine learning
models for fake news detection tasks.
18
REFERENCES
Adebimpe ., A. O.-H. (2023). ''Long Short-Term Memory Model Fake News Detection
Nigeria. lanna journal interdiscisplinary Studies
vol5(1),[Link]:https//[Link]/10.28991/ESJ-2023-07-04-015, pp167-180.
Ahmed H., T. I. (2017). ''Detection Online Fake News Using N-Gram Analysis Machine
Learning Techniques''. In Intelligent SecureDenpendable Systems Distributed Cloud
Environments ISDDC2017 Lecture Notes Computer Science vol10618. cham
Switzerland: Springer doi; 10.1007/978-3-3319-69155-8-9.
E, N. (2023). ''Detecting Fake News Using Machine Learning''. journal Student Research,
vol12(1) doi:[Link]
Ghafoor ., J. A. (2022). ''Fake News identification Social Media Using Machine Learning
Techniques''. Proceedings International Conference Information Technology
Application (p. Lecture NotesNetworks Systems Vol350). Singapore: Springer
doi:10.1007/978-981-16-7618-5_8.
Jain A., K. R. (2019). ''A Smart System Fake News DetectionUsing Machine Learning''.
ResearchGate doi:[Link]
Khan ., &. A. (2022). ''Deep Learning Techniques Fake News Detection Social Media''. IEEE
Access Vol10 pp12358 doi:[Link]
Thaher T., S. M. (2021). ''Interlligent Detection False Information Arabic Tweets Utilizing
Hybrid Harris Hawks Based Selection Machine Learning Models''. Symmetry
vol139(4), pp556.
Zhang ., &. W. (2021). ''A Support Vector Machine Approach Fake News Detection.''. Journal
information Science vol47(1), pp56-67.
19
APPENDIX A
(source code)
IMPORTING LABRARIES
import pandas as pd
import numpy as np
fake = pd.read_csv('[Link]')
true= pd.read_csv('[Link]')
[Link]()
[Link]()
true['label'] = 1
fake['label'] = 0
[Link]()
[Link]()
[Link]()
[Link]()
[Link]().sum()
[Link]()
news=[Link](frac=1)
[Link]()
news.reset_index(inplace=True)
[Link]()
20
[Link](['index'], axis=1, inplace=True)
[Link]()
import re
def wordopt(text):
text = [Link]()
text = [Link](r'https?://\S+|www\.\S+','',text)
return text
news['text'] = news['text'].apply(wordopt)
news['text']
x = news['text']
y = news['label']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
x_train.shape
x_test.shape
21
FEATURE EXTRACTION
xv_train = vectorization.fit_transform(x_train)
xv_test = [Link](x_test)
xv_train
xv_test
MODELING TRAINING
LR = LogisticRegression()
[Link](xv_train, y_train)
pred_lr = [Link](xv_test)
[Link](xv_test, y_test)
print(classification_report(y_test, pred_lr))
DT = DecisionTreeClassifier()
[Link](xv_train, y_train)
pred_dt = [Link](xv_test)
[Link](xv_test, y_test)
print(classification_report(y_test, pred_dt))
RFC = RandomForestClassifier(random_state=0)
[Link](xv_train, y_train)
22
pred_rfc = [Link](xv_test)
[Link](xv_test, y_test)
print(classification_report(y_test, pred_rfc))
cm = confusion_matrix(y_test, [Link](xv_test))
[Link](figsize=(7,6))
[Link]('Predicted Labels')
[Link]('True Labels')
[Link]()
cm = confusion_matrix(y_test, [Link](xv_test))
[Link](figsize=(7,6))
[Link]('Predicted Labels')
[Link]('True Labels')
[Link]()
23
from [Link] import confusion_matrix
cm = confusion_matrix(y_test, [Link](xv_test))
[Link](figsize=(7,6))
[Link]('Predicted values')
[Link]('True Labels')
[Link]()
import numpy as np
24
width = 0.30
x = [Link](len(metrics))
for i in range(len(metrics)):
ax.set_xticks(x + width)
ax.set_xticklabels(metrics)
ax.set_ylabel('Score')
[Link]()
[Link]()
def output_lable(n):
if n == 0:
25
return "Fake News"
elif n == 1:
import pandas as pd
def manual_testing(news):
testing_news = {"text":[news]}
new_def_test = [Link](testing_news)
new_def_test["text"] = new_def_test["text"].apply(wordopt)
new_x_test = new_def_test["text"]
new_xv_test = [Link](new_x_test)
pred_LR = [Link](new_xv_test)
pred_DT = [Link](new_xv_test)
pred_RFC = [Link](new_xv_test)
output_lable(pred_DT[0]),
output_lable(pred_RFC[0])))
news = str(input())
manual_testing(news)
26
APPENDIX B
(Outcomes)
27