0% found this document useful (0 votes)
7 views7 pages

Email Spam Detection Using Machine Learning

Email spam has become a major problem in the modern world as a result of the sharp rise in internet users. These emails are frequently used for unethical and illegal purposes, such as fraud and phishing. Through these emails, spammers disseminate dangerous links that have the potential to compromise and harm our systems. Spammers can pretend to be real people in their spam messages by creating phony email accounts and profiles with ease.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Email Spam Detection Using Machine Learning

Email spam has become a major problem in the modern world as a result of the sharp rise in internet users. These emails are frequently used for unethical and illegal purposes, such as fraud and phishing. Through these emails, spammers disseminate dangerous links that have the potential to compromise and harm our systems. Spammers can pretend to be real people in their spam messages by creating phony email accounts and profiles with ease.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

Email Spam Detection Using


Machine Learning
Chetan N1; Surya J2; Yogananda V3; Dr. Vinay K4
1;2;3
SJB Institute of Technology
4
Associate Professor, Department of MCA, SJB Institute of Technology

Publication Date: 2025/09/19

Abstract: Email spam has become a major problem in the modern world as a result of the sharp rise in internet users.
These emails are frequently used for unethical and illegal purposes, such as fraud and phishing. Through these emails,
spammers disseminate dangerous links that have the potential to compromise and harm our systems. Spammers can
pretend to be real people in their spam messages by creating phony email accounts and profiles with ease. They typically
prey on those who are not aware of these frauds. Therefore, being able to spot phony spam emails is essential. The goal of
this project is to use machine learning techniques to identify such spam. Several machine learning algorithms will be
examined in this paper, applied to our datasets, and the best algorithm will be selected.

How to Cite: Chetan N; Surya J; Yogananda V; Dr. Vinay K (2025) Email Spam Detection Using Machine Learning. International
Journal of Innovative Science and Research Technology, 10(7), 3953-3959. https://doi.org/10.38124/ijisrt/25jul1755

I. INTRODUCTION Notwithstanding these advances, ML-based spam


detection still suffers from class imbalance, concept drift,
In personal, academic, and corporate environments, and the potential for false positives. In an imbalanced
email has become a crucial means of communication. Its dataset, one class dominating the other can skew model
widespread use, however, has made it a frequent target for predictions and reduce recall for minority classes. Moreover,
nefarious activity as well. One of the most persistent spam strategies are constantly evolving, which calls for
problems in email communication is spam—unwanted, model retraining or the development of adaptable models.
pointless, and sometimes hazardous messages sent in large False positives—where legitimate messages are wrongly
numbers. Spam emails can be bothersome or they can marked as spam—remain a major worry given the possible
include fraudulent schemes, phishing links, or malware. loss of vital communication. When creating an efficient
Studies show that spam makes a notable percentage of spam detection system, therefore, maintaining precision,
global email traffic, which strains network infrastructure, recall, and adaptability over time is just as crucial as
lowers productivity, and increases the likelihood of achieving great accuracy.
cyberattacks. Traditional spam filtering methods, such rule-
based and keyword detection systems, have struggled to  Email Spam Detection Overview
keep up with the evolving strategies used by spammers. Beginning with the gathering of a labeled dataset
comprising both spam and ham (legitimate) emails, efficient
In reaction to the limitations of traditional methods, spam detection starts. Research often makes use of public
machine learning (ML) has evolved into a more flexible and datasets such the Enron Email Dataset and SpamAssassin
complex substitute for spam detection. ML models can corpus. These emails are standardized and had noise
analyze large volumes of historical email data, therefore removed by preprocessing. Preprocessing consists of
accurately predicting new messages and spotting trends. tokenization and stemming or lemmatization following the
Methods such as Naïve Bayes, Support Vector Machines removal of HTML tags, special characters, and stopwords.
(SVM), Decision Trees, and Random Forests have shown Techniques including Bag of Words (BoW), Term
promise in distinguishing spam from genuine (ham) emails Frequency-Inverse Document Frequency (TF-IDF), or word
based on characteristics drawn from email headers, content, embeddings are used to turn the cleaned text into a machine-
and metadata. Recent techniques also use deep learning readable format, therefore allowing the model to examine
models and natural language processing (NLP) to capture the frequency and context of words. Email spam
the syntactic and semantic structure of email text. Creating categorization has used a range of machine learning
ML-based spam filters involves important steps such techniques. Naive Bayes classifiers are used most
preprocessing (tokenization, stop-word removal, stemming), frequently due to their effectiveness and simplicity of
feature extraction (using methods like TF-IDF), and model handling text- based data. Some of the traditional models
training. include Logistic Regression, Support Vector Machines

IJISRT25JUL1755 www.ijisrt.com 3953


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

(SVM), and Random Forests. Some of the techniques like


ensmble and gradient boosting like XGBoost have gained
more popularity in the last few years due to the high
accuracy and stability. Models based on deep learning like
Convolutional Neural Networks (CNNs), Long Short-Term
Memory networks (LSTMs), and transformer- based models
like BERT have also been reported to be effective in
learning contextual semantics in multifaceted email
communications.

 The Benefits of an Email Spam Detection Model


Fig 1 Emails Sent and Received Everyday
 Enhanced Safety and Protection against Risks One of
the primary benefits of spam detection systems lies in
their ability to safeguard users from an array of
cybersecurity threats. Often, spam emails serve as
conduits for malware, phishing links, and deceptive
content designed to ensnare users into revealing their
personal or financial information.

 Improved Efficiency of the Email System Spam filtering


greatly minimizes the volume of unwanted messages that
reach email servers. Besides conserving bandwidth and
storage capacity, spam filtering also saves servers from
processing congestion. This, in turn, speeds up the
delivery of emails and lowers the operational costs for
companies and service providers. By incorporating
effective filtering, the overall effectiveness of the email
infrastructure is increased, resulting in better email
communication. Fig 2 Types of Spam Email

 Improved User and Organizational Productivity In the II. LITERATURE SURVEY


absence of spam filtering, users may spend a lot of time
deleting and removing unwanted or malicious messages. The author of Paper 1 presented Spam-T5, a
Active filtering decreases distraction and improves the benchmarking framework designed specifically to assess
productivity of users by allowing them to concentrate on how well large language models (LLMs) perform in spam
relevant and authentic communication. This, in turn, detection. According to the study, domain- specific fine-
improves the efficiency of workflow and decreases the tuning of transformer models greatly improves their
time spent by organizations on handling unwanted accuracy in spam classification. The approach was
emails. 4.Preservation of Communication Integrity and especially useful in identifying the subtleties of changing
Brand Reputation. Spam emails damage the image of an spam messages.
organization if they forge a company's domain or seem to
originate from internal addresses. A good spam system The writer of Paper 2 used a state-of-the-art
investigates headers, sender behavior, and message transformer model to discover an improved spam filtering
content to help eliminate such attacks. This preserves the model. To understand complex email messages, the model
continued security and reliability of communications utilized deep semantic understanding and context attention.
internally and externally. The method achieved a real-world effective email filtering
solution with improved accuracy and both false positive and
 Assistance with Regulatory Adherence Data protection false negative reduction.
laws like CAN-SPAM, GDPR, and HIPAA must be
followed by a wide range of industries. By shielding In Paper 3, the author effectively combined long short-
private data from phishing and email data leaks, efficient term memory (LSTM) networks and convolutional neural
spam detection aids in meeting these regulatory networks (CNNs) to develop a hybrid deep learning model.
requirements Organizations. can better maintain The novel architecture enhanced spam detection accuracy by
compliance and stay out of trouble with the law by effectively extracting sequential and local features in emails.
lowering their exposure to email-borne threats. On benchmark datasets, the model outperformed both
isolation and traditional deep learning methods. In Paper 4,
the researcher investigated transfer learning for the detection
of spam across all domains. The developed method allowed
a model that had been trained on a large corpus to generalize
effectively across a wide range of languages and domains.

IJISRT25JUL1755 www.ijisrt.com 3954


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

This improvement greatly improved the ability of the model  Data Collection via Email Both spam and authentic
to learn about different forms of spam while greatly (ham) emails are included in publicly accessible datasets
minimizing the need for retraining. (like the Enron or Ling- Spam datasets). According to a
number of cited papers, these datasets offer structured
In Paper 5, the author classified emails as spam using formats and are frequently utilized in benchmark
conventional machine learning methods like Naïve Bayes studies.
and Support Vector Machines. For efficient filtering, the
model used carefully designed features like word  Preprocessing Text To eliminate noise and standardize
frequencies and header analysis. It was very lightweight in inputs, emails undergo preprocessing. This Lowercasing
spite of its comparable accuracy, and therefore it was all text Eliminating numbers, special characters, and
suitable for implementation in systems with low processing punctuation Eliminating stop words Using lemmatization
capacity. The author, in Paper 6, revealed a spam filter or stemming concentrating. By only on pertinent
based on a deep learning recurrent neural network (RNN). linguistic features, these preprocessing steps have been
The model used sequential processing and word embeddings repeatedly demonstrated in numerous papers to enhance
in order to efficiently capture semantic relationships. It was model performance.
very accurate and adaptable and was extremely suitable for
large-scale deployment in cloud-based email filtering  Feature Extraction Count Vectorizer or term frequency–
systems. inverse document frequency (TF-IDF) are used to handle
feature representation, converting the cleaned text into
In Paper 7, the author had given a detailed analysis of numerical vectors. By using these methods, the system is
machine learning techniques used in email spam filtering. able to record the distribution of words and their
The paper carefully classified available techniques, importance throughout the email corpus. To improve
compared various different performance metrics, and model focus and decrease dimensionality, feature
explored open problems like data imbalance and changing selection utilizing information gain or Chi-square is
spammer tactics. In addition, it provided helpful advice for optionally used.
future research, suggesting the creation of interpretable and
flexible spam filters.  Classification Module Several machine learning models
are implemented in this module, including: Naïve
In Paper 8, the researcher investigated the use of deep Bayes (NB) for its ease of use and text
neural networks (DNNs) for spam filtering. To their classification performance High dimensional. text data
surprise, without any feature engineering by hand, the model can be handled with Support Vector Machines (SVM).
was able to learn to identify sophisticated patterns from the Decision trees (DT) and random forests (RF) are used in
data. High accuracy was attained by this method, confirming ensemble-based learning Voting classifiers or hybrid
the trend toward intelligent and scalable spam filtering models, which enhance prediction robustness by
through deep learning. combining outputs from several models.
 Proposed System  Assessment and Visualization Accuracy, precision,
Text preprocessing, feature extraction, machine recall, F1-score, and ROC-AUC are among the common
learning-based classification, and performance evaluation performance metrics used to evaluate the trained models.
are all included in the modular pipeline design of the Classification errors are visualized using confusion
suggested system for email spam detection. The system matrices. Additionally, k-fold cross- validation was
incorporates tried-and-true methods from current studies to proposed in some papers to guarantee generalization and
guarantee high accuracy and generalizability across a variety equity across different data distributions.
of spam kinds.
 System Flow Diagram
 Architecture of the System
There are five main parts to the system architecture:

Fig 3 System Flow Diagram

IJISRT25JUL1755 www.ijisrt.com 3955


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

III. RELATED WORKS helpful features for spam classification. Each tree,
however, has the potential to overfit the training set.
To create effective and precise techniques for Random Forests, being ensembles of many decision
identifying email spam, a lot of research has been done. The trees, are often employed in an effort to counteract this.
methods have changed over time, moving from They offer increased robustness and accuracy,
sophisticated deep learning and ensemble models to more particularly when working with diverse or noisy
conventional machine learning algorithms. A categorized datasets.
summary of these methods based on current research is
provided below.  Group Learning Techniques Because they can combine
the predictions of several base classifiers, ensemble
 Traditional Methods for Machine Learning techniques like bagging, boosting, and stacking have
Because of their simplicity and ease of use, machine drawn interest. By lowering bias and variance, these
learning classifiers like. techniques enhance performance. For instance, bagging
improves stability by averaging predictions across
 Naïve Bayes (NB) were a major part of the early several models, while boosting can fix mistakes made by
research. It was demonstrated that NB models, despite weak learners by concentrating more on incorrectly
being predicated on the idea of feature independence, classified instances.
could classify spam with a fair degree of accuracy. They
frequently have trouble, though, capturing intricate  Deep Learning Models With improvements in
contextual relationships in email content. computational power and data availability, deep learning
models have been a top contender for spam filtering.
 The Support Vector Machine (SVM) is another often Convolutional Neural Networks (CNNs) have the unique
used technique that is well-known for working well in capability of learning spatial patterns of the text of
high-dimensional feature spaces. When text data is emails automatically, while Recurrent Neural Networks
converted into large feature vectors using methods like (RNNs), specifically Long Short-Term Memory (LSTM)
TF-IDF, it has shown particularly well for spam networks, are best able to cope with sequential data and
detection tasks. The strength of SVM is that it is able to learn context over time. These models have been very
utilize optimal hyperplanes to classify data, especially accurate in recognizing spam, particularly when used in
when non-linear kernels are employed. conjunction with large labeled datasets.

 Random Forests and Decision Trees Decision tree  Spam Detection Techniques
classifiers are easy to interpret for identifying the most

Fig 4 Spam Detection Techniques

IV. RESULTS was achieved by LR, RF, and NB. 96% accuracy and
precision. These traditional methods performed well, which
Table indicates the promising outcome of the means they can effectively classify spam emails. With an
performance comparison of ML and DL methods for the average precision, and accuracy of 97.5%, the ANN model
spam classification of emails. A very appreciable average also performed slightly better. This suggests that DL

IJISRT25JUL1755 www.ijisrt.com 3956


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

methods can potentially enhance email spam classification, communication interfaces by showing the feasibility of
which can enhance the precision and robustness of spam traditional ML algorithms and DL methods in overcoming
filtering systems. These results pave the way for more the challenges of email spam classification.
efficient spam detection systems in electronic

Table 1 Performance of Model


Algorithm Accuracy Precision
LR 95.5 96.4
RF 97.5 98.5
NB 97.5 100
KNN 90.5 100

Fig 5 Accuracy Time Series

Fig 6 Precision Time Series

IJISRT25JUL1755 www.ijisrt.com 3957


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

Accuracy and Loss Curves are the main charts to determine the best number of epochs, and ensure that the
utilize to quantify model performance during training for model can distinguish between spam and non-spam emails.
spam classification issues in the case of ANNs. The The trade-off between the TP rate (sensitivity) and the FP
Precision Curve provides information on the learning rate (specificity) with varying threshold settings is displayed
process through the display of the model's accuracy in graphically by the Receiver Operating Characteristic (ROC)
distinguishing spam and non-spam instances in terms of curve. It reflects how well a model can distinguish between
epochs. The Loss Curve, however, displays the rate at which false positives and true positives at various thresholds. A
the training loss over time decreases, reflecting the model's higher AUC-ROC value closer to 1 reflects greater
efficiency in minimizing errors. The curves help discriminatory power and that the model is good.
practitioners and researchers to detect convergence,

Fig 7 Accuracy and Precision

V. CONCLUSION sending deceptive e-mails to build a good sending reputation


with e-mail providers, such programs try to evade e- mail
The accuracy of the spam email classifiers can be servers or software, decreasing probability of the sender's
negatively impacted by emails that are manipulated by such future emails being classified as spam. Spam classifiers for
technologies. It would therefore be extremely useful to have email can be rendered less accurate by such spoofed emails.
a collection of such emails. To confirm these findings and to It would, therefore, be of significant help to have a dataset
investigate other benefits of using the advanced approach in comprising such emails. To confirm these findings and to
anything less than the most straightforward classification find other benefits of using the new method on most
scenarios, further research and development are required. classification situations, further research and development
Employing machine learning algorithms, the suggested need to be undertaken.
approach dramatically improved the accuracy of spam email
classification. REFERENCES

As a result of the experiments, it was found that The [1]. M. Labonne and S. Moran, "Spam-T5:
accuracy, recall, and F1-score metrics were enhanced using Benchmarking LLMs for Email Spam Detection," in
the ensemble of output from a variety of simple classifiers. Proceedings of the International Conference on
The results indicate that automatic learning (ML) can Computational Linguistics (COLING), 2023.
significantly improve the accuracy of spam e-mail [2]. S. Jamal and H. Wimmer, "Improved Transformer-
classification for practical applications. With the practice of Based Spam Detection," Journal of Artificial

IJISRT25JUL1755 www.ijisrt.com 3958


Volume 10, Issue 7, July – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jul1755

Intelligence Research (JAIR), vol. 35, pp. 120-135,


2023.
[3]. S. Zavrak and S. Yilmaz, "Hybrid Deep Learning for
Email Spam Detection," IEEE Transactions on
Neural Networks and Learning Systems, vol. 34, no. 6,
pp. 987-999, 2022.
[4]. V. S. Tida and S. Hsu, "Universal Spam Detection
with Transfer Learning," in Proceedings of the ACM
Conference on Machine Learning (ACM-ML), pp.
230-242, 2022.
[5]. Narur, H. Jain, G. S. Rao, et al., "ML-Based Spam
Mail Detector," Springer Journal of Machine
Learning and Applications, vol. 27, pp. 89-104, 2023.
[6]. M. Al-Sarem, M. Al-Hadhrami, A. Alshomrani, et
al., "Deep Learning for Spam Detection," Expert
Systems with Applications, Elsevier, vol. 167, pp.
113872, 2021.
[7]. M. A. Shafi, H. Hamid, E. G. Chiroma, J. S. Dada,
and B. Abubakar, "Machine Learning for Email
Spam Filtering: Review, Approaches and Open
Research Problems," in Proceedings of the
International Conference on Artificial Intelligence
and Machine Learning (AIML), pp. 45-56, 2018.
[8]. M. Almeida, T. A. Almeida, and A. Silva, "Spam
Email Detection Using Deep Learning Techniques,"
in Proceedings of the IEEE International Conference
on Data Science and Advanced Analytics (DSAA),
pp. 92-105, 2021.
[9]. M. Madhukar and S. Verma, "Hybrid Semantic
Analysis of Tweets: A Case Study of Tweets on Girl-
Child in India," Engineering, Technology & Applied
Science Research, vol. 7, no. 5, pp. 2014–2016, Oct.
2017.
[10]. C. Bansal and B. Sidhu, ‘‘Machine learning based
hybrid approach for email spam detection,’’ in Proc.
9th Int. Conf. Rel., INFOCOM Technol. Optim., Sep.
2021, pp. 1–4.
[11]. Le, H. V., Nguyen, M. T., & Nguyen, T. T. (2018).
[12]. Email spam detection based on ensemble learning of
extreme learning machine. International Journal of
Machine Learning and Cybernetics, 9(4), 591-602.
[13]. Sahın, Esra, Murat Aydos, and Fatih Orhan.
"Spam/ham e-mail classification using
machine learning methods based on bag of words
technique." 2018 26th Signal Processing and
Communications Applications Conference (SIU).
IEEE.

IJISRT25JUL1755 www.ijisrt.com 3959

You might also like