Email Spam Detection Using Machine Learning
Email Spam Detection Using Machine Learning
Abstract: Email spam has become a major problem in the modern world as a result of the sharp rise in internet users.
These emails are frequently used for unethical and illegal purposes, such as fraud and phishing. Through these emails,
spammers disseminate dangerous links that have the potential to compromise and harm our systems. Spammers can
pretend to be real people in their spam messages by creating phony email accounts and profiles with ease. They typically
prey on those who are not aware of these frauds. Therefore, being able to spot phony spam emails is essential. The goal of
this project is to use machine learning techniques to identify such spam. Several machine learning algorithms will be
examined in this paper, applied to our datasets, and the best algorithm will be selected.
How to Cite: Chetan N; Surya J; Yogananda V; Dr. Vinay K (2025) Email Spam Detection Using Machine Learning. International
Journal of Innovative Science and Research Technology, 10(7), 3953-3959. https://doi.org/10.38124/ijisrt/25jul1755
This improvement greatly improved the ability of the model Data Collection via Email Both spam and authentic
to learn about different forms of spam while greatly (ham) emails are included in publicly accessible datasets
minimizing the need for retraining. (like the Enron or Ling- Spam datasets). According to a
number of cited papers, these datasets offer structured
In Paper 5, the author classified emails as spam using formats and are frequently utilized in benchmark
conventional machine learning methods like Naïve Bayes studies.
and Support Vector Machines. For efficient filtering, the
model used carefully designed features like word Preprocessing Text To eliminate noise and standardize
frequencies and header analysis. It was very lightweight in inputs, emails undergo preprocessing. This Lowercasing
spite of its comparable accuracy, and therefore it was all text Eliminating numbers, special characters, and
suitable for implementation in systems with low processing punctuation Eliminating stop words Using lemmatization
capacity. The author, in Paper 6, revealed a spam filter or stemming concentrating. By only on pertinent
based on a deep learning recurrent neural network (RNN). linguistic features, these preprocessing steps have been
The model used sequential processing and word embeddings repeatedly demonstrated in numerous papers to enhance
in order to efficiently capture semantic relationships. It was model performance.
very accurate and adaptable and was extremely suitable for
large-scale deployment in cloud-based email filtering Feature Extraction Count Vectorizer or term frequency–
systems. inverse document frequency (TF-IDF) are used to handle
feature representation, converting the cleaned text into
In Paper 7, the author had given a detailed analysis of numerical vectors. By using these methods, the system is
machine learning techniques used in email spam filtering. able to record the distribution of words and their
The paper carefully classified available techniques, importance throughout the email corpus. To improve
compared various different performance metrics, and model focus and decrease dimensionality, feature
explored open problems like data imbalance and changing selection utilizing information gain or Chi-square is
spammer tactics. In addition, it provided helpful advice for optionally used.
future research, suggesting the creation of interpretable and
flexible spam filters. Classification Module Several machine learning models
are implemented in this module, including: Naïve
In Paper 8, the researcher investigated the use of deep Bayes (NB) for its ease of use and text
neural networks (DNNs) for spam filtering. To their classification performance High dimensional. text data
surprise, without any feature engineering by hand, the model can be handled with Support Vector Machines (SVM).
was able to learn to identify sophisticated patterns from the Decision trees (DT) and random forests (RF) are used in
data. High accuracy was attained by this method, confirming ensemble-based learning Voting classifiers or hybrid
the trend toward intelligent and scalable spam filtering models, which enhance prediction robustness by
through deep learning. combining outputs from several models.
Proposed System Assessment and Visualization Accuracy, precision,
Text preprocessing, feature extraction, machine recall, F1-score, and ROC-AUC are among the common
learning-based classification, and performance evaluation performance metrics used to evaluate the trained models.
are all included in the modular pipeline design of the Classification errors are visualized using confusion
suggested system for email spam detection. The system matrices. Additionally, k-fold cross- validation was
incorporates tried-and-true methods from current studies to proposed in some papers to guarantee generalization and
guarantee high accuracy and generalizability across a variety equity across different data distributions.
of spam kinds.
System Flow Diagram
Architecture of the System
There are five main parts to the system architecture:
III. RELATED WORKS helpful features for spam classification. Each tree,
however, has the potential to overfit the training set.
To create effective and precise techniques for Random Forests, being ensembles of many decision
identifying email spam, a lot of research has been done. The trees, are often employed in an effort to counteract this.
methods have changed over time, moving from They offer increased robustness and accuracy,
sophisticated deep learning and ensemble models to more particularly when working with diverse or noisy
conventional machine learning algorithms. A categorized datasets.
summary of these methods based on current research is
provided below. Group Learning Techniques Because they can combine
the predictions of several base classifiers, ensemble
Traditional Methods for Machine Learning techniques like bagging, boosting, and stacking have
Because of their simplicity and ease of use, machine drawn interest. By lowering bias and variance, these
learning classifiers like. techniques enhance performance. For instance, bagging
improves stability by averaging predictions across
Naïve Bayes (NB) were a major part of the early several models, while boosting can fix mistakes made by
research. It was demonstrated that NB models, despite weak learners by concentrating more on incorrectly
being predicated on the idea of feature independence, classified instances.
could classify spam with a fair degree of accuracy. They
frequently have trouble, though, capturing intricate Deep Learning Models With improvements in
contextual relationships in email content. computational power and data availability, deep learning
models have been a top contender for spam filtering.
The Support Vector Machine (SVM) is another often Convolutional Neural Networks (CNNs) have the unique
used technique that is well-known for working well in capability of learning spatial patterns of the text of
high-dimensional feature spaces. When text data is emails automatically, while Recurrent Neural Networks
converted into large feature vectors using methods like (RNNs), specifically Long Short-Term Memory (LSTM)
TF-IDF, it has shown particularly well for spam networks, are best able to cope with sequential data and
detection tasks. The strength of SVM is that it is able to learn context over time. These models have been very
utilize optimal hyperplanes to classify data, especially accurate in recognizing spam, particularly when used in
when non-linear kernels are employed. conjunction with large labeled datasets.
Random Forests and Decision Trees Decision tree Spam Detection Techniques
classifiers are easy to interpret for identifying the most
IV. RESULTS was achieved by LR, RF, and NB. 96% accuracy and
precision. These traditional methods performed well, which
Table indicates the promising outcome of the means they can effectively classify spam emails. With an
performance comparison of ML and DL methods for the average precision, and accuracy of 97.5%, the ANN model
spam classification of emails. A very appreciable average also performed slightly better. This suggests that DL
methods can potentially enhance email spam classification, communication interfaces by showing the feasibility of
which can enhance the precision and robustness of spam traditional ML algorithms and DL methods in overcoming
filtering systems. These results pave the way for more the challenges of email spam classification.
efficient spam detection systems in electronic
Accuracy and Loss Curves are the main charts to determine the best number of epochs, and ensure that the
utilize to quantify model performance during training for model can distinguish between spam and non-spam emails.
spam classification issues in the case of ANNs. The The trade-off between the TP rate (sensitivity) and the FP
Precision Curve provides information on the learning rate (specificity) with varying threshold settings is displayed
process through the display of the model's accuracy in graphically by the Receiver Operating Characteristic (ROC)
distinguishing spam and non-spam instances in terms of curve. It reflects how well a model can distinguish between
epochs. The Loss Curve, however, displays the rate at which false positives and true positives at various thresholds. A
the training loss over time decreases, reflecting the model's higher AUC-ROC value closer to 1 reflects greater
efficiency in minimizing errors. The curves help discriminatory power and that the model is good.
practitioners and researchers to detect convergence,
As a result of the experiments, it was found that The [1]. M. Labonne and S. Moran, "Spam-T5:
accuracy, recall, and F1-score metrics were enhanced using Benchmarking LLMs for Email Spam Detection," in
the ensemble of output from a variety of simple classifiers. Proceedings of the International Conference on
The results indicate that automatic learning (ML) can Computational Linguistics (COLING), 2023.
significantly improve the accuracy of spam e-mail [2]. S. Jamal and H. Wimmer, "Improved Transformer-
classification for practical applications. With the practice of Based Spam Detection," Journal of Artificial