SMS SPAM DETECTION
USING MACHINE LEARNING
APPLIED DATA SCIENCE PROJECT
Abstract
Content:
With the rapid rise in mobile communication, SMS
has become a common mode of interaction.
Unfortunately, it is also a target for spam. This
project presents an intelligent spam detection
system that classifies SMS messages as "spam" or
"ham" using machine learning models.
We applied preprocessing and TF-IDF vectorization
on a labelled dataset and trained three models—
Multinomial Naive Bayes, Linear SVM, and
Random Forest.
Model performance was evaluated using accuracy
and F1-score, with visual comparisons. The
system supports real-time predictions and can be
Introduction
Content:
SMS is a quick and widely used communication
method, but the rise in spam has created
safety and annoyance issues.
This project addresses the problem using
machine learning techniques to build a system
that detects spam messages automatically.
We used real SMS datasets and applied Naive
Bayes, SVM, and Random Forest models.
The system can also visualize model
performance, predict spam probability, and
save/load the trained models.
Project Background
Content:
As spam grows, traditional keyword-based filters are no longer
effective.
Spam messages evolve constantly, making static rules unreliable.
Machine learning provides a dynamic solution that learns from data
and adapts.
This project aims to test and compare different models to find the
most accurate spam classifier.
Problem and Objective
Content:
Problem: SMS spam is increasing, affecting communication and
security.
Objective: To build a machine learning-based system that can
classify SMS messages into spam or ham with high accuracy.
Dataset Description
Content:
•Dataset: spam.csv
•Shape: 5572 rows × 2 columns
•Columns:
•label: "spam" or "ham"
•message: Actual SMS content
Challenges:
•Class imbalance: more ham than spam
•Noisy language, slang, and typos
•Evolving spam patterns
Text Preprocessing
•Cleaned special characters, numbers, and punctuation
•Converted text to lowercase
•Tokenized messages into words
•Removed stopwords
•Applied TF-IDF vectorization for numerical representation
Proposed Solution and Pipeline
Title: End-to-End Pipeline for Spam Detection
Content:
Data Collection and Cleaning
TF-IDF Vectorization
Model Training (Naive Bayes, SVM, Random Forest)
Evaluation using Accuracy, F1-Score,spam f1 score , ham f1
Score Best Model Selection
Real-time Prediction and Visualization
Model Saving and Deployment
Model 1 – Multinomial Naive Bayes
Content:
•Probabilistic classifier based on Bayes’ theorem
•Assumes word independence
•Fast and efficient for text classification
•Outputs class label and prediction probability
Model 2 – Linear SVM
Content:
Finds optimal hyperplane for separation
High accuracy for sparse data
Works well with TF-IDF vectors
Best F1-score among all models
Model 3 – Random Forest Classifier
•Ensemble model with multiple decision trees
•Reduces overfitting through averaging
•Good for complex data
•Slower than other models, slightly less
effective for SMS data
Model Evaluation Metrics
Accuracy: Correct predictions / Total
messages
Spam F1-Score: Measures how well the
model identifies spam messages
Ham F1-Score: Measures how well the
model identifies ham messages (balance
between ham precision and recall)
F1-Score: Harmonic mean of precision and
recall
Real-Time Spam Prediction
Content:
•New SMS input by the user
•Preprocessed and vectorized like training data
•Prediction made using best-performing model
•Outputs:
• Spam or Ham label
• Probability score (e.g., 85% spam likelihood)
Model Deployment
Content:
Best model and TF-IDF vectorizer saved using
Pickle
Can be reloaded anytime for predictions
Suitable for integration in real-world apps (e.g.,
mobile, web)
Visualizations
Content:
•Bar chart of model performance (accuracy, F1-score)
•Spam probability comparison for a sample message across all
models
•Helps in model selection and understanding model
confidence
Conclusion
Content:
Machine learning models can effectively classify SMS as spam
or ham.
SVM delivered the best performance, followed by Naive
Bayes.
The project showcases how practical and scalable ML
solutions can help combat spam in real-time environments.
Future Work
Content:
Expand dataset (different languages, larger samples)
Thank You
Presented by:
•B. Lakshmi Thirupathamma –
AP22110010472
•K. Sai Nikhitha – AP22110010498
•K. Mohana Samanya –
AP22110010523
•B. Sai Sushanth – AP22110010590