0% found this document useful (0 votes)
13 views7 pages

SVM Lab Report

The document details the development and evaluation of a Spam Email Classifier using Support Vector Machine (SVM), highlighting key metrics like precision, recall, and F1-score. It outlines the preprocessing steps, feature extraction, and the training process of the SVM model, as well as its performance evaluation through confusion matrices and time analysis. The classifier demonstrates high precision and recall, indicating its effectiveness in distinguishing between spam and non-spam emails.

Uploaded by

m.mouhcine1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

SVM Lab Report

The document details the development and evaluation of a Spam Email Classifier using Support Vector Machine (SVM), highlighting key metrics like precision, recall, and F1-score. It outlines the preprocessing steps, feature extraction, and the training process of the SVM model, as well as its performance evaluation through confusion matrices and time analysis. The classifier demonstrates high precision and recall, indicating its effectiveness in distinguishing between spam and non-spam emails.

Uploaded by

m.mouhcine1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Spam Email Classifier Using SVM

Made by : Ahannach yassine


EL Garte Mouhcine

1. Introduction
This section presents the performance and evaluation results of the Spam Email
Classifier using Support Vector Machine (SVM). The results include key
evaluation metrics such as precision, recall, and F1-score, along with insights
from the confusion matrix and time analysis to assess the model's efficiency and
reliability in classifying spam and non-spam emails.

2. Feature Extraction and Preprocessing:


This script processes an email dataset to identify unique words and their
frequency of occurrence. The cleaned and processed data is then saved to a CSV
file, which can be used as a feature set for machine learning tasks like spam
detection.

1. text_cleanup(text):
o Removes punctuation and stopwords (e.g., "the", "is").
o Converts words to lowercase for consistency.
o Output: A list of cleaned words from the input text.
2. extract_unique_words(data_path):
o Reads dataset: Assumes emails are in the text column of a CSV.
o Processes each email:
 Cleans text with text_cleanup.
 Lemmatizes words (e.g., "running" → "run").
 Filters short words (≤ 2 characters) and numbers.
o Counts occurrences of unique words in a dictionary.
o Saves the word frequencies as [Link].
3. Key Logic:
o Lemmatization ensures related words are grouped.
o Short words and digits are ignored to focus on meaningful data.
o Sorting by frequency helps identify the most important words.
4. Output:
o A CSV file ([Link]) with two columns:
 word: The unique word.
 count: Its frequency in the dataset.

Important Lines
 Text cleaning

 Lemmatization and counting:

 Save to CSV:

Role in the Lab


This script extracts features (unique words with frequencies) from email text,
which will later be used as input for machine learning models in spam
classification.

Preprocessing and Feature Extraction


 Processes email data to generate a feature matrix where:
o Rows represent individual emails.
o Columns represent the frequency of predefined words in each
email.
 Labels emails as ham (1) or spam (-1).
 Saves the feature matrix as [Link] for use in machine learning.

Key Components
1. Input Files:
o data_path: CSV file containing emails (text column) and their labels
(label column as "ham" or "spam").
o words: List of unique words (from [Link]) used as features.
2. process_emails Function:
o Initial Setup:
 Reads the email dataset.
 Initializes a zero matrix for word counts.
o Word Processing:
 Splits each email into words and:
 Lemmatizes each word to its base form.
 Filters out stopwords, punctuation, short words, and
numbers.
 Counts occurrences of each word from the words list.
o Assign Labels:
 Maps "ham" to 1 and "spam" to -1.
o Save Results:
 Writes the word frequencies and labels to [Link].

Key Logic
1. Feature Extraction:

o Updates the count for words found in the predefined words list.
2. Label Assignment:

o Assigns a numeric label for each email for classification tasks.


3. Save to CSV:

o Combines word frequencies and labels into a single row.

Output
 [Link]:
o Contains:
 Columns for each word (from words).
 An additional output column for the email label (1 for ham,
-1 for spam).

Important Lines
1. Reading Unique Words:

o Ensures feature consistency by using predefined words.


2. Filtering and Lemmatization:

o Ensures only meaningful words are included in the count.


3. Writing Results:
o Saves the feature matrix for further analysis.

Role in the Lab


 Converts cleaned email text into a structured feature matrix required for
training machine learning models.
 Bridges the gap between raw data and supervised learning tasks.

Support Vector Machine (SVM) Model


Training and Evaluation
 Implements a Support Vector Machine (SVM) model to classify emails
as spam or ham.
 Uses a feature matrix generated from word frequencies ([Link]).
 Trains the model, evaluates it, and saves it for future predictions.

Key Components
1. Input File:
o [Link]: Contains word frequency features for each email
and a label (1 for ham, -1 for spam).
2. Functions:
train_svm(X_train, y_train):
o Trains an SVM model using a linear kernel.
o Returns the trained model.
save_model(model, vectorizer):
o Saves the trained SVM model and (optionally) a vectorizer to disk
using pickle.
o Why Save?: Enables reusability without retraining.
load_model():
o Loads the SVM model and vectorizer from disk for predictions.

main() Function
1. Data Preparation:
o Reads the [Link] file.
o Splits the data into:
 Features (X): Word frequency counts for each email.
 Labels (y): 1 for ham, -1 for spam.
o Divides the dataset into training (70%) and testing (30%) subsets.
2. Training:

o Trains the SVM model using the training data.


3. Evaluation:
o Predicts labels for the test data:
Output
1. Console Output:
o A detailed classification report showing model performance.
2. Saved Files:
o svm_spam_classifier.pkl: The trained SVM model.
o (Optionally) tfidf_vectorizer.pkl: For text preprocessing (not used
here).

Key Lines
1. Splitting the Dataset:

o Ensures 70% training and 30% testing split.


2. Training the SVM:

Uses a linear
o kernel, which is ideal for text classification.
3. Classification Report:

o Provides metrics like precision, recall, and F1-score.


4. Saving the Model:

Role in the Lab


 This script performs the final step of the pipeline:
o Builds a machine learning model.
o Evaluates its performance.
o Saves the model for deployment or future predictions.

Model Performance and Evaluation Result


Key Metrics Explained:
1. Precision:
o Measures how many emails classified as spam are actually spam.
o High precision (up to 0.99) indicates the model is good at avoiding
false positives (misclassifying ham as spam).
2. Recall:
o Measures how many actual spam emails were correctly classified.
o High recall (up to 0.96) shows the model is effectively capturing
most spam emails.
3. Trade-off:
o Precision and recall balance the model's ability to avoid false
positives and false negatives.

Confusion Matrix:
The table shows:
 Ham (non-spam):
o Correctly classified (Ham-Ham).
o Misclassified as spam (Ham-Spam).
 Spam:
o Correctly classified (Spam-Spam).
o Misclassified as non-spam (Spam-Ham).
For instance:
 In one model:
o Ham correctly classified: 1095 emails.
o Ham misclassified as spam: 16 emails.
o Spam correctly classified: 389 emails.
o Spam misclassified as ham: 52 emails.

Performance Trends:
 Precision remains consistently high across all models (0.92–0.99).
 Recall varies between 0.91–0.96, indicating some fluctuation in capturing
all spam.
 Best results are seen with:
o Precision: 0.99.
o Recall: 0.95.
o Time spent: ~190 seconds.
Time Analysis:
 Each model spends ~189–280 seconds on training and evaluation.
 Total time for processing and evaluating all models: 5302.99 seconds
(~1.5 hours).

Conclusion:
The results demonstrate that the spam classifier performs well with consistent
high precision and recall, making it reliable for distinguishing between spam and
ham. The variation in performance metrics across models suggests optimization
potential for balancing precision and recall while reducing processing time.

You might also like