Practical File
Machine Learing
B.E. (IT) 6th Semester
Submitted To: Submitted by:
Dr.Mandeep Kaur Pranav Kumar
UEM228120
IT Section-2
Department of Information Technology
University Institute of Engineering and Technology
Panjab University Chandigarh
INDEX
S.NO Practical Name Page Number Remark
1
1. Introduction to the steps of machine learning 3-7
2. Solving a regression problem using machine 8-13
learning: finding the least squares regression
line, calculating correlation, plotting, and
predicting values.
3. Applying regression analysis to a given dataset 14-17
and performing hypothesis testing.
4. Perform regression on a random dataset using 18-21
machine learning libraries.
5. Perform binary classification on a random 22-24
dataset using SVM (Support Vector Machine).
6. Train SVM on raw, normalized, and balanced 25-27
data. Analyze accuracy, precision, recall, and
F1-score.
7. Train KNN on raw, normalised and balanced 28-31
data on a random dataset and compare it with
SVM. Analyse accuracy, precision, recall and
F1-score.
8. Train Random Forest on raw, normalised and 32-35
balanced data on a random dataset and
compare it with SVM. Analyse accuracy,
precision, recall and F1-score.
Practical Number - 01
AIM: Introduction to the steps of machine learning
Description:
This practical introduces the basic steps in a machine learning workflow: reading a
CSV file, performing descriptive data exploration (head and summary statistics),
plotting data distributions using histograms and scatter plots, checking linear
relationships between features, and splitting the dataset into training and testing
sets (70%-30%) using train_test_split from scikit-learn.
2
1.Read a CSV file
Read data/csv file: We start by importing the dataset into our Python environment.
The dataset is usually in CSV (Comma-Separated Values) format. Reading the data
is the first step to begin any kind of analysis or modeling.
You can read a CSV file in Python using the pandas library. Here’s how you can do
it:
3
2.Perform descriptive exploration (head, summary statistics)
Descriptive Exploration of Data: Descriptive exploration involves getting a basic
understanding of the dataset — such as how many rows and columns it has, what
type of data each column contains, and checking for missing or unusual values.
View First Few Rows (head): The .head() function lets us see the first few rows of
the dataset. This gives us a quick look at the data structure and values.
4
5
3.Plot feature distributions (histograms, scatter plots)
Plot Feature Relationships (Scatter Plots): Scatter plots help us visually analyze the
relationship between two variables. If the points are close to a straight line, there
may be a linear relationship.
Histogram Graph
6
4.Check linear relationship between two features
Plot Feature 1 vs Feature 2 to Check Linearity: We specifically pick two important
features and create a scatter plot to check if there is a linear relationship between
them. A clear linear pattern suggests that models like Linear Regression or SVM
may work better.
7
5.Split the dataset into 70% test and 30% train
Random Selection: 70% Test and 30% Train: Usually, we split the data into
training and testing sets to evaluate model performance properly. 70% of the data is
randomly selected as the test set, the remaining 30% is used as the training set (or
vice-versa if needed). This ensures that the model learns on one part and is tested
on unseen data.
Yes! train_test_split from sklearn.model_selection is used to split a dataset into
training and testing subsets.
· train_test_split() randomly divides New_Data into:
· 70% test data (test_data)
· 30% training data (train_data)
· test_size=0.3 → 30% of the data is used for testing.
· random_state=42 ensures that the split is reproducible (same split every time
you run it).
Practical Number - 02
Aim: Solving a regression problem using machine learning: finding the least
squares regression line, calculating correlation, plotting, and predicting values.
8
Description:
You work with two variables (Driving Experience vs. Insurance Premium) to
perform a simple linear regression. This involves calculating statistical summaries
(like SSxx, SSyy, SSxy), finding the regression line equation, plotting the scatter
diagram and regression line, computing correlation coefficients, predicting values,
estimating standard deviation of errors, and constructing a 90% confidence interval
for the slope.
Python Code:
1. Importing Important Libraries and Loading Datasets
In any data analysis or machine learning task, the first step is to import the
necessary libraries and load the datasets for analysis.
1.1 Libraries Used:
· NumPy:
9
NumPy (Numerical Python) is a powerful library for numerical computing
in Python. It provides support for arrays, matrices, and a large number of
mathematical functions to operate on these arrays. · Imported as: import numpy
as np · Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. Here, we import the pyplot module for
plotting graphs.
· Imported as: import matplotlib.pyplot as plt ·
SciPy(Statsmodule):
SciPy is an open-source Python library used for scientific and technical
computing. Its stats module contains a large number of probability
distributions and statistical functions.
· Imported as: import scipy.stats as stats
1.2 Loading the Dataset:
In this step, we define two arrays:
· X: Represents Driving Experience in years.
· Y: Represents Performance Scores based on driving experience.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Given data
X = np.array([5, 2, 12, 9, 15, 6, 25, 16]) # Driving Experience (years)
Y = np.array([64, 87, 50, 71, 44, 56, 42, 60])
(2.) Compute SSₓₓ, SSᵧᵧ, and SSₓᵧ
x_mean = np.mean(X)
y_mean = np.mean(Y)
print("X_Mean:",x_mean)
print("Y_Mean:",y_mean)
# Compute SSxx, SSyy,
SSxy
SSxx = np.sum((X - x_mean) ** 2)
SSyy = np.sum((Y - y_mean) ** 2) SSxy =
np.sum((X - x_mean) * (Y - y_mean))
10
print("Value of SSxx:",SSxx) print("Value
of SSyy:",SSyy) print("Value of
SSxy:",SSxy)
OUTPUT:
(3.)Find the least squares regression Iine by choosing appropriate dependent
and independent variables based on your answer part 1.
# Compute regression coefficients b
= SSxy / SSxx # Slope a = y_mean -
b * x_mean # Intercept
print("Value of a:",a) print("Value of b:",b) # Print regression
equation print(f"Least Squares Regression Line: Y = {a:.2f} +
{b:.2f}X")
OUTPUT:
(4.)Plots the scatter diagram and regression line
# Plot scatter plot and regression line plt.scatter(X, Y, color='blue', label='Data
Points') plt.plot(X, y_predicted, color='red', label=f'Regression Line: Y = {a:.2f} +
{b:.2f}X') plt.xlabel("Driving Experience (years)") plt.ylabel("Monthly Auto
Insurance Premium ($)") plt.title("Linear Regression: Driving Experience vs
Insurance Premium") plt.legend() plt.show()
11
OUTPUT:
(5.)Computes correlation coefficient (r) and R²
# Compute correlation coefficient (r) and coefficient of determination
(R²) r = SSxy / np.sqrt(SSxx * SSyy) r_squared = r ** 2
print(f"Correlation Coefficient (r): {r:.4f}")
print(f"Coefficient of Determination (R²): {r_squared:.4f}")
OUTPUT:
(6.)Predicts insurance premium for 10 years of driving experience
# Predict insurance premium for 10 years of experience
x_pred = 10 y_pred = a + b * x_pred
print(f"Predicted insurance premium for 10 years of experience: ${y_pred:.2f}")
OUTPUT:
(7.)Computes standard deviation of errors
12
# Compute standard deviation of errors
y_predicted = a + b * X residuals = Y -
y_predicted s_e = np.sqrt(np.sum(residuals ** 2) /
(len(X) - 2)) print(f"Standard Deviation of Errors:
{s_e:.2f}")
OUTPUT:
(8.)Constructs a 90% confidence interval for B (slope)
# Compute 90% confidence interval for B
n = len(X)
SE_b = s_e / np.sqrt(SSxx) # Standard error of slope t_value = stats.t.ppf(0.95, df=n-2)
# 90% confidence level conf_interval = (b - t_value * SE_b, b + t_value * SE_b)
print(f"90% Confidence Interval for B: ({conf_interval[0]:.4f}, {conf_interval[1]:.4f})")
OUTPUT:
13
Practical Number - 03
Aim: Applying regression analysis to a given dataset and performing hypothesis
testing.
Description:
This task continues the regression work: calculating regression parameters, finding
correlation, plotting data, predicting cholesterol levels, calculating the standard
deviation of errors, constructing confidence intervals for the slope, and hypothesis
testing (both on slope and correlation coefficient) for a 5% and 2.5% significance
level.
PYTHONCODE:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
x = np.array([58, 69, 43, 39, 63, 52, 47, 31, 74, 36]) y =
np.array([189, 235, 193, 177, 154, 191, 213, 165, 198, 181]) n =
len(x)
14
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_x2 = np.sum(x**2)
sum_y2 = np.sum(y**2)
sum_xy = np.sum(x * y)
# Compute SS_xx, SS_yy, SS_xy
SS_xx = sum_x2 - (sum_x**2) / n
SS_yy = sum_y2 - (sum_y**2) / n
SS_xy = sum_xy - (sum_x * sum_y) / n print(f"SS_xx = {SS_xx:.2f},
SS_yy = {SS_yy:.2f}, SS_xy = {SS_xy:.2f}")
# Find the regression line: y = a + b*x
b = SS_xy / SS_xx a = (sum_y / n) - b
* (sum_x / n)
print(f"Regression Equation: Y = {a:.2f} + ({b:.2f}X)")
# Calculate r and r^2 r = SS_xy / np.sqrt(SS_xx * SS_yy)
r_squared = r**2 print(f"Correlation Coefficient (r):
{r:.3f}") print(f"Coefficient of Determination (r^2):
{r_squared:.3f}")
# Plot the scatter diagram and regression line
x_range = np.linspace(min(x), max(x), 100)
y_line = a + b * x_range
plt.scatter(x, y, color='blue', label='Data Points')
plt.plot(x_range, y_line, color='red', label='Regression Line')
plt.xlabel("Age") plt.ylabel("Cholesterol Level")
plt.title("Regression: Age vs Cholesterol") plt.legend()
plt.grid() plt.show()
# Predict cholesterol for a 60 year old man age_pred = 60 cholesterol_pred = a +
b * age_pred print(f"Predicted cholesterol level for a 60-year-old man:
{cholesterol_pred:.2f}") # Standard deviation of errors y_hat = a + b * x
residuals = y - y_hat sum_squared_errors =
np.sum(residuals**2) s =
np.sqrt(sum_squared_errors / (n - 2))
print(f"Standard Deviation of Errors: {s:.3f}")
# 95% Confidence interval for B SE_B = s / np.sqrt(SS_xx) alpha = 0.05
t_critical_95 = t.ppf(1 - alpha/2, df=n-2) lower_bound = b - t_critical_95 * SE_B
15
upper_bound = b + t_critical_95 * SE_B print(f"95% Confidence Interval for B:
({lower_bound:.3f},{upper_bound:.3f})")
# Test at 5% if B is positive t_stat_b = b / SE_B t_critical_b =
t.ppf(1 - 0.05, df=n-2) print(f"t-statistic for B: {t_stat_b:.3f}")
print(f"Critical t-value (one-tailed, alpha=0.05): {t_critical_b:.3f}")
if t_stat_b > t_critical_b:
print("Reject H0: There is sufficient evidence that B is positive.")
else: print("Fail to reject H0: Insufficient evidence that B is
positive.")
# Test if correlation coefficient r is positive at α = 0.025
SE_r = np.sqrt((1 - r**2) / (n - 2)) t_stat_r = r / SE_r
t_critical_r = t.ppf(1 - 0.025, df=n-2) print(f"t-statistic
for r: {t_stat_r:.3f}")
print(f"Critical t-value (one-tailed, alpha=0.025): {t_critical_r:.3f}")
if t_stat_r > t_critical_r:
print("Reject H0: There is sufficient evidence that the correlation coefficient is positive.")
else:
print("Fail to reject H0: Insufficient evidence that the correlation coefficient is positive.")
OUTPUT:
16
17
Practical Number - 04
Aim: Perform regression on a random dataset using machine learning libraries.
Description:
A real-world dataset (Experience vs. Salary) is used. The dataset is split into
training and testing sets, a Linear Regression model is fitted, predictions are made,
and errors (Root Mean Square Error, Mean Absolute Error) are calculated.
Visualizations like histogram plots and Actual vs Predicted scatter plots are
generated to analyze model performance.
Python Code:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
# Load your dataset
df = pd.read_csv("Experience-Salary.csv") # <-- your path
print(df.head())
# Assuming columns are named: "Years_of_Experience" and "Salary"
target = df['salary(in thousands)']
data = df['exp(in months)']
# Train-Test Split
X_train, X_test, Y_train, Y_test = train_test_split(
data.values.reshape(-1, 1), target, train_size=0.7, random_state=0)
print("X_train.shape, Y_train.shape", X_train.shape, Y_train.shape)
print("X_test.shape, Y_test.shape", X_test.shape, Y_test.shape)
# Model Training
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
18
# Parameters
print("Slope:", regressor.coef_)
print("y_Intercept:", regressor.intercept_)
# Predictions
y_test_pred = regressor.predict(X_test)
temp_df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_test_pred})
print(temp_df)
# Histograms
sns.histplot(Y_test, color='red', alpha=0.5, bins=5)
sns.histplot(y_test_pred, color='blue', alpha=0.5, bins=5)
plt.show()
# Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
normalized_data = scaler.fit_transform(df)
print("Normalized Data:\n", normalized_data)
# Error Metrics
print("Root Mean Square Error =", np.sqrt(metrics.mean_squared_error(Y_test, y_test_pred)))
print("Mean Absolute Error =", metrics.mean_absolute_error(Y_test, y_test_pred))
print("R2 Score:", metrics.r2_score(Y_test, y_test_pred))
# Scatter Plot (Actual vs Predicted)
plt.scatter(Y_test, y_test_pred, color='blue', marker='o', alpha=0.7, edgecolors='black')
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], '--', color='red')
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual VS Predicted Salary")
plt.show()
OUTPUT:
19
20
21
Practical Number - 05
Aim: Perform binary classification on a random dataset using SVM (Support
Vector Machine).
Description:
This practical covers the complete classification pipeline: loading a dataset related
to Breast cancer detection, preprocessing it (handling missing values and encoding
categorical variables), splitting into train/test sets, training an SVM classifier, and
evaluating its performance using metrics like accuracy, precision, recall, F1 score,
and ROC-AUC curve.
DATASET:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
recall_score, f1_score, roc_curve, auc)
# Load dataset
file_path = "data.csv" # Change to your actual file
df = pd.read_csv(file_path)
# Drop 'id' and unnecessary unnamed columns
df_cleaned = df.drop(columns=["id"])
df_cleaned = df_cleaned.loc[:, ~df_cleaned.columns.str.contains('^Unnamed')]
22
# Handle missing values
print("Missing values before handling:\n", df_cleaned.isnull().sum())
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
df_cleaned[numeric_cols] =
df_cleaned[numeric_cols].fillna(df_cleaned[numeric_cols].median())
# Encode 'diagnosis' column (M=1, B=0)
if 'diagnosis' in df_cleaned.columns:
df_cleaned['diagnosis'] = df_cleaned['diagnosis'].map({'M': 1, 'B': 0})
# Confirm no missing values
print("Missing values after handling:\n", df_cleaned.isnull().sum())
# Separate features and target
X = df_cleaned.drop(columns=["diagnosis"])
y = df_cleaned["diagnosis"]
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train SVM model
model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
y_probs = model.predict_proba(X_test)[:, 1]
# Evaluation metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall: {rec:.3f}")
print(f"F1 Score: {f1:.3f}")
23
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, label=f"SVM (AUC = {roc_auc:.2f})", color='darkorange')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()
OUTPUT:
24
Practical Number - 06
Aim: Train SVM on raw, normalized, and balanced data. Analyze accuracy,
precision, recall, and F1-score.
Description:
Here, the impact of data preprocessing is studied. The SVM model is trained on
raw, normalized, and oversampled (balanced) data to handle class imbalance.
Model evaluation is done using performance metrics, and a comparison table
summarizes the effects of preprocessing on the model's effectiveness.
DATASET:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
# Load dataset
data = pd.read_csv('data.csv') # <-- apna file path correct dena
# Drop 'id' and unnamed columns if exist
data = data.drop(columns=['id'], errors='ignore')
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]
25
# Handle missing values
for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].median())
# Encode categorical columns
for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
# Confirm clean dataset
print("Dataset after cleaning:\n", data.head())
# Separate features and target
X = data.drop(columns=['diagnosis']) # Change 'diagnosis' to your target column if different
y = data['diagnosis']
# Dictionary to store results
results = {}
# Function to train and evaluate SVM
def train_evaluate_svm(X, y, description):
print(f"\nTraining SVM on {description}...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results[description] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
"Recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
"F1-Score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
}
# 1. SVM on Raw Data
train_evaluate_svm(X, y, "Raw Data")
# 2. Normalize the data
scaler = StandardScaler()
26
X_normalized = scaler.fit_transform(X)
# SVM on Normalized Data
train_evaluate_svm(X_normalized, y, "Normalized Data")
# 3. Balance the raw data
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)
# SVM on Balanced Data
train_evaluate_svm(X_balanced, y_balanced, "Balanced Data")
# Final results table
final_results = pd.DataFrame(results).T
print("\nFinal Comparison Table:")
print(final_results)
# Optional: Plot Accuracy, Precision, Recall, F1-Score
final_results.plot(kind='bar', figsize=(10,6))
plt.title('Comparison of SVM Models')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.grid()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
27
28
Practical Number - 07
Aim:Train a KNN classifier under three conditions (raw, normalized, balanced
data) and compare results with SVM.
DESCRIPTION:
You apply both algorithms (KNN and SVM) and see how their performances vary
across different pre-processing conditions.
DATASET:
PythonCode:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler
# Load the dataset
data = pd.read_csv("diabetes.csv")
# Fill missing values
for col in data.columns:
29
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].mean())
# Encode categorical columns
for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
# Separate features and target
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# Dictionaries to store results
knn_results = {}
svm_results = {}
# Function to train and evaluate model
def train_evaluate(model, X, y, description, results_dict):
print(f"\nTraining {model.__class__.__name__} on {description}...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results_dict[description] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
"Recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
"F1-Score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
}
# ------------------- Apply SVM -------------------
# Raw Data
train_evaluate(SVC(), X, y, "Raw Data (SVM)", svm_results)
# Normalized Data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
train_evaluate(SVC(), X_normalized, y, "Normalized Data (SVM)", svm_results)
# Balanced Data
30
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)
train_evaluate(SVC(), X_balanced, y_balanced, "Balanced Data (SVM)", svm_results)
# ------------------- Apply KNN -------------------
# Raw Data
train_evaluate(KNeighborsClassifier(), X, y, "Raw Data (KNN)", knn_results)
# Normalized Data
train_evaluate(KNeighborsClassifier(), X_normalized, y, "Normalized Data (KNN)", knn_results)
# Balanced and Normalized Data
X_balanced_normalized = scaler.fit_transform(X_balanced) # Normalize balanced data
train_evaluate(KNeighborsClassifier(), X_balanced_normalized, y_balanced, "Balanced Data (KNN)",
knn_results)
# ------------------- Results -------------------
# KNN Results
print("\nKNN Results Table:")
print(pd.DataFrame(knn_results).T)
# SVM Results
print("\nSVM Results Table:")
print(pd.DataFrame(svm_results).T)
# Comparison Table: SVM vs KNN (Accuracy only)
comparison_df = pd.DataFrame({
'SVM Accuracy': [
svm_results['Raw Data (SVM)']["Accuracy"],
svm_results['Normalized Data (SVM)']["Accuracy"],
svm_results['Balanced Data (SVM)']["Accuracy"]
],
'KNN Accuracy': [
knn_results['Raw Data (KNN)']["Accuracy"],
knn_results['Normalized Data (KNN)']["Accuracy"],
knn_results['Balanced Data (KNN)']["Accuracy"]
]
}, index=["Raw Data", "Normalized Data", "Balanced Data"])
print("\nComparison between SVM and KNN (Accuracy only):")
print(comparison_df)
31
OUTPUT:
32
Practical Number - 08
AIM:Train Random Forest on raw, normalised and balanced data on a random
dataset and compare it with SVM. Analyse accuracy, precision, recall and F1score.
DESCRIPTION:
This practical shows how an ensemble method like Random Forest can perform
better than SVM in some cases.
Python Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler
# Load the dataset
data = pd.read_csv("diabetes.csv")
# Fill missing values
for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].mean())
# Encode categorical columns
for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
# Separate features and target
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
33
# Dictionaries to store results
rf_results = {}
svm_results = {}
# Function to train and evaluate model
def train_evaluate(model, X, y, description, results_dict):
print(f"\nTraining {model.__class__.__name__} on {description}...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results_dict[description] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
"Recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
"F1-Score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
}
# ------------------- Apply SVM -------------------
# Raw Data
train_evaluate(SVC(), X, y, "Raw Data (SVM)", svm_results)
# Normalized Data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
train_evaluate(SVC(), X_normalized, y, "Normalized Data (SVM)", svm_results)
# Balanced Data
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)
train_evaluate(SVC(), X_balanced, y_balanced, "Balanced Data (SVM)", svm_results)
# ------------------- Apply Random Forest -------------------
# Raw Data
train_evaluate(RandomForestClassifier(random_state=42), X, y, "Raw Data (Random Forest)",
rf_results)
# Normalized Data
train_evaluate(RandomForestClassifier(random_state=42), X_normalized, y, "Normalized Data
(Random Forest)", rf_results)
# Balanced and Normalized Data
X_balanced_normalized = scaler.fit_transform(X_balanced)
34
train_evaluate(RandomForestClassifier(random_state=42), X_balanced_normalized, y_balanced,
"Balanced Data (Random Forest)", rf_results)
# ------------------- Final Results -------------------
# Random Forest Results
print("\nRandom Forest Results Table:")
print(pd.DataFrame(rf_results).T)
# SVM Results
print("\nSVM Results Table:")
print(pd.DataFrame(svm_results).T)
# Comparison Table: SVM vs Random Forest (Accuracy only)
comparison_df = pd.DataFrame({
'SVM Accuracy': [
svm_results['Raw Data (SVM)']["Accuracy"],
svm_results['Normalized Data (SVM)']["Accuracy"],
svm_results['Balanced Data (SVM)']["Accuracy"]
],
'Random Forest Accuracy': [
rf_results['Raw Data (Random Forest)']["Accuracy"],
rf_results['Normalized Data (Random Forest)']["Accuracy"],
rf_results['Balanced Data (Random Forest)']["Accuracy"]
]
}, index=["Raw Data", "Normalized Data", "Balanced Data"])
print("\nComparison between SVM and Random Forest (Accuracy only):")
print(comparison_df)
OUTPUT:
35
36