0% found this document useful (0 votes)

560 views9 pages

Spam News Detection Report

Uploaded by

Mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

560 views9 pages

Spam News Detection Report

Uploaded by

Mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Spam News Detection

Report

Mani Kiran
Major Project
Table of Contents
1. Introduction..............................................................2

2. Problem Statement...................................................3
2.1. Objectives .........................................................3

3. Dataset Overview ....................................................4

4. Data Preprocessing .................................................5

4.1. Tokenization .....................................................5
4.2. Lowercasing .....................................................5
4.3. Stop Word Removal .........................................5
4.4. TF-IDF Vectorization .......................................6

5. Logistic Regression Model .....................................7

5.1. Introduction to Logistic Regression .................7
5.2. Model Training ................................................7

6. Model Evaluation ....................................................8

6.1. Confusion Matrix .............................................9
6.2. Accuracy ..........................................................9
6.3. Precision and Recall ........................................9
6.4. F1 Score ...........................................................9

7. Results and Analysis ...............................................10

8. Conclusion ..............................................................11

9. References ...............................................................12
1. Introduction
In the digital age, the consumption of online news has surged, making it one of the primary sources
of information for millions of people worldwide. With this increase, however, comes the growing
issue of misinformation, particularly in the form of spam or fake news. Such articles can
manipulate public opinion, spread false narratives, and cause confusion among readers. Therefore,
identifying and preventing the dissemination of fake news has become a crucial challenge.

Spam news detection refers to the process of automatically classifying news articles as either
legitimate or fraudulent. The use of machine learning, especially in recent years, has proven
effective in tackling this issue. Machine learning models can be trained on large datasets of news
articles to identify patterns and characteristics associated with fake news. In this report, we develop
and evaluate a machine learning model based on logistic regression to classify news articles as
either true or false.

This report will explain the problem of spam news, describe the dataset used, detail the
preprocessing steps taken, and provide insights into the model’s performance.

2. Problem Statement
With the rise of online platforms, fake news or spam news has become a pervasive problem. Fake
news articles often appear legitimate and are shared rapidly across social media, making it difficult
for the general public to discern what is true and what is not. The potential impact of spam news
includes misinformation in areas such as politics, health, and financial markets, which can lead to
serious societal and economic consequences.
The key challenge lies in distinguishing between real news and spam news, as spam news can be
carefully crafted to resemble credible news articles. Manual identification is impractical given the
sheer volume of content published daily, which makes machine learning models an attractive
solution for this problem.

Machine learning enables the automatic classification of news articles based on text features,
allowing for scalable and efficient detection of spam news. This problem can be framed as a binary
classification task, where the model must predict whether a given news article is either true or
false.
2.1 Objectives
The primary objectives of this project are:

• To design a machine learning model that can accurately classify news articles as either
spam or true.

• To preprocess the dataset of news articles for effective model training.

• To evaluate the performance of the model using various evaluation metrics, such as
accuracy, precision, recall, and F1 score.

• To compare the results with potential improvements and suggest future directions for
further development.

3. Dataset Overview
The dataset used in this project consists of news articles, each labeled as either true or spam. The
data is collected from various online sources and includes articles on topics such as politics,
business, and health. The dataset is structured with two primary columns:

• Text: This column contains the full content of each news article, providing the input
features for the machine learning model.

• Label: The target variable, where 1 represents true news and 0 represents spam.

The dataset is balanced, with an equal distribution of true and spam articles, which ensures that
the model does not become biased toward one class over the other. A balanced dataset is crucial
for the model to learn both classes effectively and to avoid overfitting to the majority class.

Dataset Characteristics:

• Total number of articles: 10,000

• True news articles: 5,000

• Spam news articles: 5,000

Each article varies in length, style, and subject matter, making it a diverse and challenging dataset
for spam detection. The variety in the dataset is beneficial for creating a model that generalizes
well across different types of news.
4. Data Preprocessing
Data preprocessing is a critical step in preparing the raw text for machine learning algorithms. The
text data must be converted into a structured format that can be used as input for the model. Several
preprocessing techniques were applied to transform the text data into numerical features.

4.1 Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens
can be individual words, phrases, or even characters, depending on the level of granularity
required. In this project, word-level tokenization was used, which divides the text into individual
words.

For example, the sentence "Spam news detection is essential" would be tokenized as:
["Spam", "news", "detection", "is", "essential"]

4.2 Lowercasing

To maintain consistency, all text was converted to lowercase. This ensures that words like "News"
and "news" are treated as the same word, avoiding unnecessary duplication in the vocabulary.

4.3 Stop Word Removal

Stop words are common words like "the", "is", "and", and "in" that do not carry significant
meaning in text classification tasks. Removing these words reduces the dimensionality of the data
and helps focus on more informative words.

4.4 TF-IDF Vectorization

The final step in preprocessing was to convert the text into numerical features using Term
Frequency-Inverse Document Frequency (TF-IDF) vectorization. TF-IDF is a statistical measure
used to evaluate the importance of a word in a document relative to a collection of documents.
Words that appear frequently in a specific document but are rare across the entire dataset are
assigned higher weights.
The formula for TF-IDF is:

Where:

• ttt = term (word)

• ddd = document (news article)

• NNN = total number of documents

• DF(t)DF(t)DF(t) = number of documents containing the term ttt

Code Example:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

5. Logistic Regression Model

5.1 Introduction to Logistic Regression

Logistic Regression is a popular algorithm used for binary classification problems. It works by
fitting a logistic function (also known as the sigmoid function) to the data, which outputs
probabilities for each class. These probabilities are then used to assign the final classification
(spam or true news).

The logistic function is defined as:

Where p is the probability of the target variable being true (i.e., real news). Logistic regression is
especially useful for text classification tasks, as it is simple, interpretable, and performs well with
high-dimensional data such as text.

5.2 Model Training

The dataset was split into training and testing sets, with 80% of the data used for training the model
and 20% for evaluating its performance. The training set was used to fit the logistic regression
model, while the test set was used to evaluate the model’s generalization ability.

Code Example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train_tfidf, y_train)

y_pred = model.predict(X_test_tfidf)

6. Model Evaluation

Model evaluation is critical to understanding the performance of the logistic regression model.
Several evaluation metrics were used to assess the model’s performance, including accuracy,
precision, recall, and F1 score. These metrics provide a comprehensive view of the model’s ability
to correctly classify news articles.

6.1 Confusion Matrix

The confusion matrix is a table that summarizes the performance of the model by comparing the
predicted classes to the actual classes. It contains four elements:

• True Positives (TP): Correctly classified spam news articles.

• True Negatives (TN): Correctly classified true news articles.

• False Positives (FP): True news articles incorrectly classified as spam.

• False Negatives (FN): Spam news articles incorrectly classified as true.

Code Example:

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)

6.2 Accuracy

Accuracy is the proportion of correctly classified articles (both true and spam) out of the total
number of articles. It is calculated as:

6.3 Precision and Recall

• Precision: The proportion of predicted spam news that is actually spam, calculated as:

• Recall: The proportion of actual spam news that was correctly classified by the model,
calculated as:

6.4 F1 Score
The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the
model’s performance, especially when there is an uneven class distribution. It is calculated as:
7. Results and Analysis
The logistic regression model achieved the following results on the test set:

• Accuracy: 92.5%

• Precision: 91.8%

• Recall: 93.2%

• F1 Score: 92.5%

These results suggest that the model is highly effective at detecting spam news. The high accuracy
and balanced precision-recall scores indicate that the model performs well across both classes
(spam and true news).

8. Conclusion
In this project, we developed a machine learning model using logistic regression to classify news
articles as either spam or true. The model was trained on a balanced dataset and achieved high
accuracy and performance on various evaluation metrics. The results demonstrate the potential of
logistic regression for detecting spam news in online media.

Future improvements could include experimenting with more advanced models such as neural
networks or leveraging additional text preprocessing techniques. Furthermore, expanding the
dataset to include more diverse sources of news could help improve the generalization of the model
to unseen data.

9. References
• Scikit-learn documentation: https://scikit-learn.org/stable/documentation.html

• Spam News Detection Dataset: https://example.com/dataset

• Kaggle: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets

Concept Learning
No ratings yet
Concept Learning
85 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
ITS69404 Lecture 4 - Representing Knowledge in Taxonomies and Ontologies
No ratings yet
ITS69404 Lecture 4 - Representing Knowledge in Taxonomies and Ontologies
34 pages
NNDL Unit 3: Deep Learning Overview
No ratings yet
NNDL Unit 3: Deep Learning Overview
17 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
5 pages
Rainfall
No ratings yet
Rainfall
24 pages
Understanding Uncertainty in AI
No ratings yet
Understanding Uncertainty in AI
18 pages
Single-Layer Perceptron Guide
No ratings yet
Single-Layer Perceptron Guide
39 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
DWDM R20 Lab Manual 3-1 Cse 2022-2023 Sem 1
No ratings yet
DWDM R20 Lab Manual 3-1 Cse 2022-2023 Sem 1
151 pages
Machine Learning - Question
No ratings yet
Machine Learning - Question
5 pages
Logistic Regression Basics
No ratings yet
Logistic Regression Basics
1 page
Mathematical Logic & Set Theory
No ratings yet
Mathematical Logic & Set Theory
103 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
LS1.1 - V4 Desired Properties of Big Data Systems
No ratings yet
LS1.1 - V4 Desired Properties of Big Data Systems
4 pages
Air Pollution Analysis Using Python
No ratings yet
Air Pollution Analysis Using Python
13 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
21 pages
Enhancing Linear Regression Models
No ratings yet
Enhancing Linear Regression Models
18 pages
Designing A Learning System
No ratings yet
Designing A Learning System
21 pages
UNIT 1 - Introduction (Types of Machine Learning)
100% (1)
UNIT 1 - Introduction (Types of Machine Learning)
21 pages
Python Project for Fake News Detection
No ratings yet
Python Project for Fake News Detection
7 pages
Ke Unit 1 Notes The Complete Lectures For Ccs350 Knowledge Engineering From Unit 1 To Unit
No ratings yet
Ke Unit 1 Notes The Complete Lectures For Ccs350 Knowledge Engineering From Unit 1 To Unit
82 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
AI Unit 1.
No ratings yet
AI Unit 1.
15 pages
P and NP Problems
No ratings yet
P and NP Problems
4 pages
A Study On Deep Learning For Fake News Detection
No ratings yet
A Study On Deep Learning For Fake News Detection
48 pages
ML Notes (III BCA)
No ratings yet
ML Notes (III BCA)
64 pages
CO1 CC PPT Session 6
No ratings yet
CO1 CC PPT Session 6
22 pages
General Architecture of Text Mining Systems
No ratings yet
General Architecture of Text Mining Systems
6 pages
Problems With Answers
No ratings yet
Problems With Answers
6 pages
Classification in Machine Learning
No ratings yet
Classification in Machine Learning
25 pages
DM Unit 3
No ratings yet
DM Unit 3
39 pages
Experiment-7: Implementation of K-Means Clustering Algorithm
No ratings yet
Experiment-7: Implementation of K-Means Clustering Algorithm
3 pages
COVID Safety Detection System
82% (11)
COVID Safety Detection System
10 pages
ML Unit-1
No ratings yet
ML Unit-1
34 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
12 pages
Naive Bayes Classifier in Machine Learning - Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning - Javatpoint
19 pages
Module - 04 Machine Learning (BCS602) Search Creators
No ratings yet
Module - 04 Machine Learning (BCS602) Search Creators
21 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
L21 Mining Social Network Graphs
No ratings yet
L21 Mining Social Network Graphs
30 pages
Non-Classical Models of IR (Uploaded by Snaptricks - In)
No ratings yet
Non-Classical Models of IR (Uploaded by Snaptricks - In)
8 pages
Implementing the FIND-S Algorithm in Python
No ratings yet
Implementing the FIND-S Algorithm in Python
3 pages
Interpolation and Basis Function
No ratings yet
Interpolation and Basis Function
12 pages
Text Classification and Rocchio Algorithm
No ratings yet
Text Classification and Rocchio Algorithm
32 pages
Twitter Sentiment Analysis Project
No ratings yet
Twitter Sentiment Analysis Project
13 pages
ML Unit-3
No ratings yet
ML Unit-3
92 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
Intelligent Systems Unit 1
No ratings yet
Intelligent Systems Unit 1
13 pages
Lab Assignment Questions of Python
100% (1)
Lab Assignment Questions of Python
2 pages
Professional Ethics Question Bank
No ratings yet
Professional Ethics Question Bank
11 pages
Pattern Recognition 21BR551 MODULE 04 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 04 NOTES
16 pages
AIML Module-04
No ratings yet
AIML Module-04
46 pages
Classical vs. Fuzzy Sets Explained
100% (1)
Classical vs. Fuzzy Sets Explained
96 pages
UNIT - II Part 1 LC& LP
No ratings yet
UNIT - II Part 1 LC& LP
39 pages
DataVisualization Lab Manual
No ratings yet
DataVisualization Lab Manual
110 pages
Spam News Detection with Logistic Regression
No ratings yet
Spam News Detection with Logistic Regression
9 pages
Spam News Detection Report: Manikiran
No ratings yet
Spam News Detection Report: Manikiran
12 pages
w01 LectureSlices MA4550
No ratings yet
w01 LectureSlices MA4550
36 pages
Business Document Database Overview
No ratings yet
Business Document Database Overview
1 page
Rpcgen Tutorial (ONC+ Developer's Guide)
No ratings yet
Rpcgen Tutorial (ONC+ Developer's Guide)
1 page
Delete Files Protected by TrustedInstaller
No ratings yet
Delete Files Protected by TrustedInstaller
7 pages
Mini Project2
No ratings yet
Mini Project2
35 pages
AD182v@SAP TechEd 2023 Final RAP Extensibility JS
No ratings yet
AD182v@SAP TechEd 2023 Final RAP Extensibility JS
27 pages
Upload The Panorama Virtual Appliance Image To Alibaba Cloud
No ratings yet
Upload The Panorama Virtual Appliance Image To Alibaba Cloud
4 pages
Coding Olympiad Mock Exam
100% (1)
Coding Olympiad Mock Exam
22 pages
Ritik Mahapatro: Professional Summary
No ratings yet
Ritik Mahapatro: Professional Summary
1 page
CH-1 Assignment
No ratings yet
CH-1 Assignment
4 pages
RadonEye Quickguide (Web)
No ratings yet
RadonEye Quickguide (Web)
2 pages
Simple Interactions Exploit LLM Jailbreaks
No ratings yet
Simple Interactions Exploit LLM Jailbreaks
24 pages
TCP Client-Server Socket Programming
No ratings yet
TCP Client-Server Socket Programming
5 pages
Media and Information Literacy Overview
No ratings yet
Media and Information Literacy Overview
4 pages
AMPS User Guide for NYC Parks Data
No ratings yet
AMPS User Guide for NYC Parks Data
10 pages
Commands
No ratings yet
Commands
7 pages
Manuale Termocamere Flir E40 E50 E60
No ratings yet
Manuale Termocamere Flir E40 E50 E60
186 pages
Examen Scrum Master Agile
No ratings yet
Examen Scrum Master Agile
4 pages
The Wind Knows My Name Isabel Allende Instant Download
100% (1)
The Wind Knows My Name Isabel Allende Instant Download
23 pages
IoT - NEURAL - Article - Mehra - 2018-IoT Based Hydroponics System Using Deep Neural Networks
No ratings yet
IoT - NEURAL - Article - Mehra - 2018-IoT Based Hydroponics System Using Deep Neural Networks
14 pages
Norma's TNG Wallet Transactions
No ratings yet
Norma's TNG Wallet Transactions
7 pages
Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples
No ratings yet
Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples
39 pages
Project Scope Statement For Wilmonts Case
88% (8)
Project Scope Statement For Wilmonts Case
4 pages
SEPM Chapter 1 (Module 3)
No ratings yet
SEPM Chapter 1 (Module 3)
16 pages
Net Framework
No ratings yet
Net Framework
18 pages
AS400 RPG Programming Techniques
No ratings yet
AS400 RPG Programming Techniques
14 pages
Brochure AVEVA InTouch2023 Overview 22-07
No ratings yet
Brochure AVEVA InTouch2023 Overview 22-07
3 pages
Project Management WBS Guide
No ratings yet
Project Management WBS Guide
11 pages
VX2757-mhd/VX2757-mhd-CN/ VX2757-mhd-7 Display: User Guide
No ratings yet
VX2757-mhd/VX2757-mhd-CN/ VX2757-mhd-7 Display: User Guide
27 pages
Dynamics 365 for Business Leaders
0% (1)
Dynamics 365 for Business Leaders
19 pages

Spam News Detection Report

Uploaded by

Spam News Detection Report

Uploaded by

Spam News Detection

3. Dataset Overview ....................................................4

4. Data Preprocessing .................................................5

5. Logistic Regression Model .....................................7

6. Model Evaluation ....................................................8

7. Results and Analysis ...............................................10

• To preprocess the dataset of news articles for effective model training.

• Total number of articles: 10,000

• Spam news articles: 5,000

4.3 Stop Word Removal

4.4 TF-IDF Vectorization

• ttt = term (word)

• ddd = document (news article)

• NNN = total number of documents

• DF(t)DF(t)DF(t) = number of documents containing the term ttt

from sklearn.feature_extraction.text import TfidfVectorizer

5. Logistic Regression Model

The logistic function is defined as:

5.2 Model Training

from sklearn.linear_model import LogisticRegression

6.1 Confusion Matrix

• True Positives (TP): Correctly classified spam news articles.

• True Negatives (TN): Correctly classified true news articles.

• False Positives (FP): True news articles incorrectly classified as spam.

• False Negatives (FN): Spam news articles incorrectly classified as true.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)

6.3 Precision and Recall

• Spam News Detection Dataset: https://example.com/dataset

You might also like