0% found this document useful (0 votes)

18 views36 pages

ML Updated File

The document is a practical file for a Machine Learning course, detailing various practical exercises related to regression analysis, classification, and model evaluation using Python. It includes steps for data exploration, regression problem-solving, hypothesis testing, and the application of machine learning libraries. Each practical section outlines aims, descriptions, and Python code snippets for implementation.

Uploaded by

pranav1256kam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views36 pages

ML Updated File

Uploaded by

pranav1256kam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Practical File

Machine Learing
B.E. (IT) 6th Semester

Submitted To: Submitted by:

Dr.Mandeep Kaur Pranav Kumar
UEM228120

IT Section-2

Department of Information Technology

University Institute of Engineering and Technology

Panjab University Chandigarh

INDEX
S.NO Practical Name Page Number Remark

1
1. Introduction to the steps of machine learning 3-7
2. Solving a regression problem using machine 8-13
learning: finding the least squares regression
line, calculating correlation, plotting, and
predicting values.
3. Applying regression analysis to a given dataset 14-17
and performing hypothesis testing.
4. Perform regression on a random dataset using 18-21
machine learning libraries.
5. Perform binary classification on a random 22-24
dataset using SVM (Support Vector Machine).

6. Train SVM on raw, normalized, and balanced 25-27

data. Analyze accuracy, precision, recall, and
F1-score.
7. Train KNN on raw, normalised and balanced 28-31
data on a random dataset and compare it with
SVM. Analyse accuracy, precision, recall and
F1-score.
8. Train Random Forest on raw, normalised and 32-35
balanced data on a random dataset and
compare it with SVM. Analyse accuracy,
precision, recall and F1-score.

Practical Number - 01
AIM: Introduction to the steps of machine learning
Description:
This practical introduces the basic steps in a machine learning workflow: reading a
CSV file, performing descriptive data exploration (head and summary statistics),
plotting data distributions using histograms and scatter plots, checking linear
relationships between features, and splitting the dataset into training and testing
sets (70%-30%) using train_test_split from scikit-learn.

2
1.Read a CSV file
Read data/csv file: We start by importing the dataset into our Python environment.
The dataset is usually in CSV (Comma-Separated Values) format. Reading the data
is the first step to begin any kind of analysis or modeling.
You can read a CSV file in Python using the pandas library. Here’s how you can do
it:

3
2.Perform descriptive exploration (head, summary statistics)
Descriptive Exploration of Data: Descriptive exploration involves getting a basic
understanding of the dataset — such as how many rows and columns it has, what
type of data each column contains, and checking for missing or unusual values.

View First Few Rows (head): The .head() function lets us see the first few rows of
the dataset. This gives us a quick look at the data structure and values.

4
5
3.Plot feature distributions (histograms, scatter plots)
Plot Feature Relationships (Scatter Plots): Scatter plots help us visually analyze the
relationship between two variables. If the points are close to a straight line, there
may be a linear relationship.

Histogram Graph

6
4.Check linear relationship between two features
Plot Feature 1 vs Feature 2 to Check Linearity: We specifically pick two important
features and create a scatter plot to check if there is a linear relationship between
them. A clear linear pattern suggests that models like Linear Regression or SVM
may work better.

7
5.Split the dataset into 70% test and 30% train
Random Selection: 70% Test and 30% Train: Usually, we split the data into
training and testing sets to evaluate model performance properly. 70% of the data is
randomly selected as the test set, the remaining 30% is used as the training set (or
vice-versa if needed). This ensures that the model learns on one part and is tested
on unseen data.

Yes! train_test_split from sklearn.model_selection is used to split a dataset into

training and testing subsets.

· train_test_split() randomly divides New_Data into:

· 70% test data (test_data)
· 30% training data (train_data)
· test_size=0.3 → 30% of the data is used for testing.
· random_state=42 ensures that the split is reproducible (same split every time
you run it).
Practical Number - 02
Aim: Solving a regression problem using machine learning: finding the least
squares regression line, calculating correlation, plotting, and predicting values.
8
Description:
You work with two variables (Driving Experience vs. Insurance Premium) to
perform a simple linear regression. This involves calculating statistical summaries
(like SSxx, SSyy, SSxy), finding the regression line equation, plotting the scatter
diagram and regression line, computing correlation coefficients, predicting values,
estimating standard deviation of errors, and constructing a 90% confidence interval
for the slope.

Python Code:

1. Importing Important Libraries and Loading Datasets

In any data analysis or machine learning task, the first step is to import the
necessary libraries and load the datasets for analysis.

1.1 Libraries Used:

· NumPy:

9
NumPy (Numerical Python) is a powerful library for numerical computing
in Python. It provides support for arrays, matrices, and a large number of
mathematical functions to operate on these arrays. · Imported as: import numpy
as np · Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. Here, we import the pyplot module for
plotting graphs.
· Imported as: import matplotlib.pyplot as plt ·
SciPy(Statsmodule):
SciPy is an open-source Python library used for scientific and technical
computing. Its stats module contains a large number of probability
distributions and statistical functions.
· Imported as: import scipy.stats as stats

1.2 Loading the Dataset:

In this step, we define two arrays:

· X: Represents Driving Experience in years.

· Y: Represents Performance Scores based on driving experience.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Given data
X = np.array([5, 2, 12, 9, 15, 6, 25, 16]) # Driving Experience (years)
Y = np.array([64, 87, 50, 71, 44, 56, 42, 60])
(2.) Compute SSₓₓ, SSᵧᵧ, and SSₓᵧ

x_mean = np.mean(X)
y_mean = np.mean(Y)
print("X_Mean:",x_mean)
print("Y_Mean:",y_mean)
# Compute SSxx, SSyy,
SSxy
SSxx = np.sum((X - x_mean) ** 2)
SSyy = np.sum((Y - y_mean) ** 2) SSxy =
np.sum((X - x_mean) * (Y - y_mean))
10
print("Value of SSxx:",SSxx) print("Value
of SSyy:",SSyy) print("Value of
SSxy:",SSxy)

OUTPUT:

(3.)Find the least squares regression Iine by choosing appropriate dependent

and independent variables based on your answer part 1.

# Compute regression coefficients b

= SSxy / SSxx # Slope a = y_mean -
b * x_mean # Intercept
print("Value of a:",a) print("Value of b:",b) # Print regression
equation print(f"Least Squares Regression Line: Y = {a:.2f} +
{b:.2f}X")
OUTPUT:

(4.)Plots the scatter diagram and regression line

# Plot scatter plot and regression line plt.scatter(X, Y, color='blue', label='Data
Points') plt.plot(X, y_predicted, color='red', label=f'Regression Line: Y = {a:.2f} +
{b:.2f}X') plt.xlabel("Driving Experience (years)") plt.ylabel("Monthly Auto
Insurance Premium ($)") plt.title("Linear Regression: Driving Experience vs
Insurance Premium") plt.legend() plt.show()

11
OUTPUT:

(5.)Computes correlation coefficient (r) and R²

# Compute correlation coefficient (r) and coefficient of determination
(R²) r = SSxy / np.sqrt(SSxx * SSyy) r_squared = r ** 2
print(f"Correlation Coefficient (r): {r:.4f}")
print(f"Coefficient of Determination (R²): {r_squared:.4f}")

OUTPUT:

(6.)Predicts insurance premium for 10 years of driving experience

# Predict insurance premium for 10 years of experience
x_pred = 10 y_pred = a + b * x_pred
print(f"Predicted insurance premium for 10 years of experience: ${y_pred:.2f}")

OUTPUT:

(7.)Computes standard deviation of errors

12
# Compute standard deviation of errors
y_predicted = a + b * X residuals = Y -
y_predicted s_e = np.sqrt(np.sum(residuals ** 2) /
(len(X) - 2)) print(f"Standard Deviation of Errors:
{s_e:.2f}")

OUTPUT:

(8.)Constructs a 90% confidence interval for B (slope)

# Compute 90% confidence interval for B
n = len(X)
SE_b = s_e / np.sqrt(SSxx) # Standard error of slope t_value = stats.t.ppf(0.95, df=n-2)
# 90% confidence level conf_interval = (b - t_value * SE_b, b + t_value * SE_b)
print(f"90% Confidence Interval for B: ({conf_interval[0]:.4f}, {conf_interval[1]:.4f})")

OUTPUT:

13
Practical Number - 03
Aim: Applying regression analysis to a given dataset and performing hypothesis
testing.

Description:
This task continues the regression work: calculating regression parameters, finding
correlation, plotting data, predicting cholesterol levels, calculating the standard
deviation of errors, constructing confidence intervals for the slope, and hypothesis
testing (both on slope and correlation coefficient) for a 5% and 2.5% significance
level.

PYTHONCODE:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t

x = np.array([58, 69, 43, 39, 63, 52, 47, 31, 74, 36]) y =
np.array([189, 235, 193, 177, 154, 191, 213, 165, 198, 181]) n =
len(x)

14
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_x2 = np.sum(x**2)
sum_y2 = np.sum(y**2)
sum_xy = np.sum(x * y)

# Compute SS_xx, SS_yy, SS_xy

SS_xx = sum_x2 - (sum_x**2) / n
SS_yy = sum_y2 - (sum_y**2) / n
SS_xy = sum_xy - (sum_x * sum_y) / n print(f"SS_xx = {SS_xx:.2f},
SS_yy = {SS_yy:.2f}, SS_xy = {SS_xy:.2f}")

# Find the regression line: y = a + b*x

b = SS_xy / SS_xx a = (sum_y / n) - b
* (sum_x / n)
print(f"Regression Equation: Y = {a:.2f} + ({b:.2f}X)")

# Calculate r and r^2 r = SS_xy / np.sqrt(SS_xx * SS_yy)

r_squared = r**2 print(f"Correlation Coefficient (r):
{r:.3f}") print(f"Coefficient of Determination (r^2):
{r_squared:.3f}")

# Plot the scatter diagram and regression line

x_range = np.linspace(min(x), max(x), 100)
y_line = a + b * x_range
plt.scatter(x, y, color='blue', label='Data Points')
plt.plot(x_range, y_line, color='red', label='Regression Line')
plt.xlabel("Age") plt.ylabel("Cholesterol Level")
plt.title("Regression: Age vs Cholesterol") plt.legend()
plt.grid() plt.show()

# Predict cholesterol for a 60 year old man age_pred = 60 cholesterol_pred = a +

b * age_pred print(f"Predicted cholesterol level for a 60-year-old man:
{cholesterol_pred:.2f}") # Standard deviation of errors y_hat = a + b * x
residuals = y - y_hat sum_squared_errors =
np.sum(residuals**2) s =
np.sqrt(sum_squared_errors / (n - 2))
print(f"Standard Deviation of Errors: {s:.3f}")

# 95% Confidence interval for B SE_B = s / np.sqrt(SS_xx) alpha = 0.05

t_critical_95 = t.ppf(1 - alpha/2, df=n-2) lower_bound = b - t_critical_95 * SE_B

15
upper_bound = b + t_critical_95 * SE_B print(f"95% Confidence Interval for B:
({lower_bound:.3f},{upper_bound:.3f})")

# Test at 5% if B is positive t_stat_b = b / SE_B t_critical_b =

t.ppf(1 - 0.05, df=n-2) print(f"t-statistic for B: {t_stat_b:.3f}")
print(f"Critical t-value (one-tailed, alpha=0.05): {t_critical_b:.3f}")
if t_stat_b > t_critical_b:
print("Reject H0: There is sufficient evidence that B is positive.")
else: print("Fail to reject H0: Insufficient evidence that B is
positive.")

# Test if correlation coefficient r is positive at α = 0.025

SE_r = np.sqrt((1 - r**2) / (n - 2)) t_stat_r = r / SE_r
t_critical_r = t.ppf(1 - 0.025, df=n-2) print(f"t-statistic
for r: {t_stat_r:.3f}")

print(f"Critical t-value (one-tailed, alpha=0.025): {t_critical_r:.3f}")

if t_stat_r > t_critical_r:
print("Reject H0: There is sufficient evidence that the correlation coefficient is positive.")
else:
print("Fail to reject H0: Insufficient evidence that the correlation coefficient is positive.")
OUTPUT:

16
17
Practical Number - 04
Aim: Perform regression on a random dataset using machine learning libraries.
Description:
A real-world dataset (Experience vs. Salary) is used. The dataset is split into
training and testing sets, a Linear Regression model is fitted, predictions are made,
and errors (Root Mean Square Error, Mean Absolute Error) are calculated.
Visualizations like histogram plots and Actual vs Predicted scatter plots are
generated to analyze model performance.

Python Code:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Load your dataset

df = pd.read_csv("Experience-Salary.csv") # <-- your path
print(df.head())

# Assuming columns are named: "Years_of_Experience" and "Salary"

target = df['salary(in thousands)']
data = df['exp(in months)']

# Train-Test Split
X_train, X_test, Y_train, Y_test = train_test_split(
data.values.reshape(-1, 1), target, train_size=0.7, random_state=0)

print("X_train.shape, Y_train.shape", X_train.shape, Y_train.shape)

print("X_test.shape, Y_test.shape", X_test.shape, Y_test.shape)

# Model Training
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

18
# Parameters
print("Slope:", regressor.coef_)
print("y_Intercept:", regressor.intercept_)

# Predictions
y_test_pred = regressor.predict(X_test)
temp_df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_test_pred})
print(temp_df)

# Histograms
sns.histplot(Y_test, color='red', alpha=0.5, bins=5)
sns.histplot(y_test_pred, color='blue', alpha=0.5, bins=5)
plt.show()

# Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
normalized_data = scaler.fit_transform(df)
print("Normalized Data:\n", normalized_data)

# Error Metrics
print("Root Mean Square Error =", np.sqrt(metrics.mean_squared_error(Y_test, y_test_pred)))
print("Mean Absolute Error =", metrics.mean_absolute_error(Y_test, y_test_pred))
print("R2 Score:", metrics.r2_score(Y_test, y_test_pred))

# Scatter Plot (Actual vs Predicted)

plt.scatter(Y_test, y_test_pred, color='blue', marker='o', alpha=0.7, edgecolors='black')
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], '--', color='red')
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual VS Predicted Salary")
plt.show()

OUTPUT:

19
20
21
Practical Number - 05
Aim: Perform binary classification on a random dataset using SVM (Support
Vector Machine).

Description:
This practical covers the complete classification pipeline: loading a dataset related
to Breast cancer detection, preprocessing it (handling missing values and encoding
categorical variables), splitting into train/test sets, training an SVM classifier, and
evaluating its performance using metrics like accuracy, precision, recall, F1 score,
and ROC-AUC curve.
DATASET:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
recall_score, f1_score, roc_curve, auc)

# Load dataset
file_path = "data.csv" # Change to your actual file
df = pd.read_csv(file_path)

# Drop 'id' and unnecessary unnamed columns

df_cleaned = df.drop(columns=["id"])
df_cleaned = df_cleaned.loc[:, ~df_cleaned.columns.str.contains('^Unnamed')]

22
# Handle missing values
print("Missing values before handling:\n", df_cleaned.isnull().sum())
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
df_cleaned[numeric_cols] =
df_cleaned[numeric_cols].fillna(df_cleaned[numeric_cols].median())

# Encode 'diagnosis' column (M=1, B=0)

if 'diagnosis' in df_cleaned.columns:
df_cleaned['diagnosis'] = df_cleaned['diagnosis'].map({'M': 1, 'B': 0})

# Confirm no missing values

print("Missing values after handling:\n", df_cleaned.isnull().sum())

# Separate features and target

X = df_cleaned.drop(columns=["diagnosis"])
y = df_cleaned["diagnosis"]

# Standardize the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train SVM model

model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_probs = model.predict_proba(X_test)[:, 1]

# Evaluation metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall: {rec:.3f}")
print(f"F1 Score: {f1:.3f}")

23
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, label=f"SVM (AUC = {roc_auc:.2f})", color='darkorange')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()
OUTPUT:

24
Practical Number - 06
Aim: Train SVM on raw, normalized, and balanced data. Analyze accuracy,
precision, recall, and F1-score.

Description:
Here, the impact of data preprocessing is studied. The SVM model is trained on
raw, normalized, and oversampled (balanced) data to handle class imbalance.
Model evaluation is done using performance metrics, and a comparison table
summarizes the effects of preprocessing on the model's effectiveness.

DATASET:

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv('data.csv') # <-- apna file path correct dena

# Drop 'id' and unnamed columns if exist

data = data.drop(columns=['id'], errors='ignore')
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]

25
# Handle missing values
for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].median())

# Encode categorical columns

for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])

# Confirm clean dataset

print("Dataset after cleaning:\n", data.head())

# Separate features and target

X = data.drop(columns=['diagnosis']) # Change 'diagnosis' to your target column if different
y = data['diagnosis']

# Dictionary to store results

results = {}

# Function to train and evaluate SVM

def train_evaluate_svm(X, y, description):
print(f"\nTraining SVM on {description}...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

results[description] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
"Recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
"F1-Score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
}

# 1. SVM on Raw Data

train_evaluate_svm(X, y, "Raw Data")

# 2. Normalize the data

scaler = StandardScaler()

26
X_normalized = scaler.fit_transform(X)

# SVM on Normalized Data

train_evaluate_svm(X_normalized, y, "Normalized Data")

# 3. Balance the raw data

ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)

# SVM on Balanced Data

train_evaluate_svm(X_balanced, y_balanced, "Balanced Data")

# Final results table

final_results = pd.DataFrame(results).T
print("\nFinal Comparison Table:")
print(final_results)

# Optional: Plot Accuracy, Precision, Recall, F1-Score

final_results.plot(kind='bar', figsize=(10,6))
plt.title('Comparison of SVM Models')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.grid()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

27
28
Practical Number - 07
Aim:Train a KNN classifier under three conditions (raw, normalized, balanced
data) and compare results with SVM.

DESCRIPTION:
You apply both algorithms (KNN and SVM) and see how their performances vary
across different pre-processing conditions.

DATASET:

PythonCode:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler

# Load the dataset

data = pd.read_csv("diabetes.csv")

# Fill missing values

for col in data.columns:

29
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].mean())

# Encode categorical columns

for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])

# Separate features and target

X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Dictionaries to store results

knn_results = {}
svm_results = {}

# Function to train and evaluate model

def train_evaluate(model, X, y, description, results_dict):
print(f"\nTraining {model.__class__.__name__} on {description}...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results_dict[description] = {
"Accuracy": accuracy_score(y_test, y_pred),
"Precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
"Recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
"F1-Score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
}

# ------------------- Apply SVM -------------------

# Raw Data
train_evaluate(SVC(), X, y, "Raw Data (SVM)", svm_results)

# Normalized Data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
train_evaluate(SVC(), X_normalized, y, "Normalized Data (SVM)", svm_results)

# Balanced Data

30
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)
train_evaluate(SVC(), X_balanced, y_balanced, "Balanced Data (SVM)", svm_results)

# ------------------- Apply KNN -------------------

# Raw Data
train_evaluate(KNeighborsClassifier(), X, y, "Raw Data (KNN)", knn_results)

# Normalized Data
train_evaluate(KNeighborsClassifier(), X_normalized, y, "Normalized Data (KNN)", knn_results)

# Balanced and Normalized Data

X_balanced_normalized = scaler.fit_transform(X_balanced) # Normalize balanced data
train_evaluate(KNeighborsClassifier(), X_balanced_normalized, y_balanced, "Balanced Data (KNN)",
knn_results)

# ------------------- Results -------------------

# KNN Results
print("\nKNN Results Table:")
print(pd.DataFrame(knn_results).T)

# SVM Results
print("\nSVM Results Table:")
print(pd.DataFrame(svm_results).T)

# Comparison Table: SVM vs KNN (Accuracy only)

comparison_df = pd.DataFrame({
'SVM Accuracy': [
svm_results['Raw Data (SVM)']["Accuracy"],
svm_results['Normalized Data (SVM)']["Accuracy"],
svm_results['Balanced Data (SVM)']["Accuracy"]
],
'KNN Accuracy': [
knn_results['Raw Data (KNN)']["Accuracy"],
knn_results['Normalized Data (KNN)']["Accuracy"],
knn_results['Balanced Data (KNN)']["Accuracy"]
]
}, index=["Raw Data", "Normalized Data", "Balanced Data"])

print("\nComparison between SVM and KNN (Accuracy only):")

print(comparison_df)

31
OUTPUT:

32
Practical Number - 08
AIM:Train Random Forest on raw, normalised and balanced data on a random
dataset and compare it with SVM. Analyse accuracy, precision, recall and F1score.

DESCRIPTION:
This practical shows how an ensemble method like Random Forest can perform
better than SVM in some cases.

Python Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import RandomOverSampler

# Load the dataset

data = pd.read_csv("diabetes.csv")

# Fill missing values

for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].fillna(data[col].mode()[0])
else:
data[col] = data[col].fillna(data[col].mean())

# Encode categorical columns

for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])

# Separate features and target

X = data.iloc[:, :-1]
y = data.iloc[:, -1]

33
# Dictionaries to store results
rf_results = {}
svm_results = {}

# Function to train and evaluate model

# ------------------- Apply SVM -------------------

# Raw Data
train_evaluate(SVC(), X, y, "Raw Data (SVM)", svm_results)

# Normalized Data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
train_evaluate(SVC(), X_normalized, y, "Normalized Data (SVM)", svm_results)

# Balanced Data
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)
train_evaluate(SVC(), X_balanced, y_balanced, "Balanced Data (SVM)", svm_results)

# ------------------- Apply Random Forest -------------------

# Raw Data
train_evaluate(RandomForestClassifier(random_state=42), X, y, "Raw Data (Random Forest)",
rf_results)

# Normalized Data
train_evaluate(RandomForestClassifier(random_state=42), X_normalized, y, "Normalized Data
(Random Forest)", rf_results)

# Balanced and Normalized Data

X_balanced_normalized = scaler.fit_transform(X_balanced)

34
train_evaluate(RandomForestClassifier(random_state=42), X_balanced_normalized, y_balanced,
"Balanced Data (Random Forest)", rf_results)

# ------------------- Final Results -------------------

# Random Forest Results
print("\nRandom Forest Results Table:")
print(pd.DataFrame(rf_results).T)

# SVM Results
print("\nSVM Results Table:")
print(pd.DataFrame(svm_results).T)

# Comparison Table: SVM vs Random Forest (Accuracy only)

comparison_df = pd.DataFrame({
'SVM Accuracy': [
svm_results['Raw Data (SVM)']["Accuracy"],
svm_results['Normalized Data (SVM)']["Accuracy"],
svm_results['Balanced Data (SVM)']["Accuracy"]
],
'Random Forest Accuracy': [
rf_results['Raw Data (Random Forest)']["Accuracy"],
rf_results['Normalized Data (Random Forest)']["Accuracy"],
rf_results['Balanced Data (Random Forest)']["Accuracy"]
]
}, index=["Raw Data", "Normalized Data", "Balanced Data"])

print("\nComparison between SVM and Random Forest (Accuracy only):")

print(comparison_df)

OUTPUT:

35
36

CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
Lab 11,12
No ratings yet
Lab 11,12
7 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Lab Experiments Vi Sem-1
No ratings yet
Lab Experiments Vi Sem-1
10 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Smec ML Lab Manual R22
No ratings yet
Smec ML Lab Manual R22
21 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
Assignment No.4 - (20-Ele-68)
No ratings yet
Assignment No.4 - (20-Ele-68)
17 pages
Datascience Lab
No ratings yet
Datascience Lab
24 pages
Regression Analysis Cheat Sheet
No ratings yet
Regression Analysis Cheat Sheet
9 pages
MLP Regressor with Sklearn on Wine Data
No ratings yet
MLP Regressor with Sklearn on Wine Data
10 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Python Data Analytics Techniques
No ratings yet
Python Data Analytics Techniques
10 pages
AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
No ratings yet
AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
24 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machinelearning - Lab Manual
No ratings yet
Machinelearning - Lab Manual
26 pages
Ad3411 - Dsa Lab Manual
No ratings yet
Ad3411 - Dsa Lab Manual
34 pages
Python Data Preprocessing & Regression
No ratings yet
Python Data Preprocessing & Regression
68 pages
ML
No ratings yet
ML
21 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Data Analysis with Pandas & Matplotlib
No ratings yet
Data Analysis with Pandas & Matplotlib
3 pages
Ex. No.: 01 Working With Numpy Arrays
No ratings yet
Ex. No.: 01 Working With Numpy Arrays
30 pages
Python Code for Central Tendency
No ratings yet
Python Code for Central Tendency
28 pages
Regression and Hypothesis Testing Methods
No ratings yet
Regression and Hypothesis Testing Methods
8 pages
ML Lab Manual 2024
No ratings yet
ML Lab Manual 2024
41 pages
ML Manoj
No ratings yet
ML Manoj
51 pages
ML Manual New
No ratings yet
ML Manual New
38 pages
Ad3411 - Data Science and Analytics Laboratory
No ratings yet
Ad3411 - Data Science and Analytics Laboratory
26 pages
Lab Manual (DAV)
No ratings yet
Lab Manual (DAV)
33 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
MLCyber Lab
No ratings yet
MLCyber Lab
9 pages
ML Lab Manual
No ratings yet
ML Lab Manual
19 pages
ML Lab Manual With Statistical Formulas
No ratings yet
ML Lab Manual With Statistical Formulas
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
ML Lab Mala Reddy CLG
No ratings yet
ML Lab Mala Reddy CLG
23 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Machine Learning Laboratory Exercises
No ratings yet
Machine Learning Laboratory Exercises
16 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
16 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Python Linear Regression Guide
No ratings yet
Python Linear Regression Guide
23 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
27 pages
Machine Learning Evaluation Guide
100% (1)
Machine Learning Evaluation Guide
504 pages
R22 ML Lab Manual
No ratings yet
R22 ML Lab Manual
25 pages
ML Manual
No ratings yet
ML Manual
53 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Machine Learning File
No ratings yet
Machine Learning File
28 pages
Experimenting With Data Analysis Packages and Statistical Operations
No ratings yet
Experimenting With Data Analysis Packages and Statistical Operations
18 pages
Sumit Kumar
No ratings yet
Sumit Kumar
58 pages
Check Balance and Imbalance Using Stack
No ratings yet
Check Balance and Imbalance Using Stack
2 pages
ML 2 Marks Quick Revision
No ratings yet
ML 2 Marks Quick Revision
3 pages
PDFF
No ratings yet
PDFF
15 pages
ML 01 (Shubham)
No ratings yet
ML 01 (Shubham)
14 pages
ML File 17 March
No ratings yet
ML File 17 March
18 pages
ML 01 (Pranavv)
No ratings yet
ML 01 (Pranavv)
14 pages
A.C. Joshi Library Panjab University, Chandigarh
No ratings yet
A.C. Joshi Library Panjab University, Chandigarh
1 page
Pranavsql
No ratings yet
Pranavsql
26 pages
World Literature Midterm
100% (1)
World Literature Midterm
6 pages
First Quarter Exam in English 10
100% (1)
First Quarter Exam in English 10
2 pages
Representing Numbers To 1000
No ratings yet
Representing Numbers To 1000
4 pages
Class 7 Worksheet CH 8
No ratings yet
Class 7 Worksheet CH 8
11 pages
Decowell Catalogue Price List - EX Series
No ratings yet
Decowell Catalogue Price List - EX Series
1 page
Grammar Practice U05
No ratings yet
Grammar Practice U05
2 pages
Starshop ERD Spec
No ratings yet
Starshop ERD Spec
3 pages
IPCRF For Proficient Teachers SY 2023 2024 1
No ratings yet
IPCRF For Proficient Teachers SY 2023 2024 1
9 pages
Ms..Office Notes
100% (1)
Ms..Office Notes
40 pages
CUET UG 2024 Test Paper Details
No ratings yet
CUET UG 2024 Test Paper Details
37 pages
Understanding PPAT in Education
100% (3)
Understanding PPAT in Education
13 pages
Latihan Ulangan Bahasa Inggris Kelas 3 Bab 7
100% (1)
Latihan Ulangan Bahasa Inggris Kelas 3 Bab 7
2 pages
Motion Time and Place According To William Ockham
No ratings yet
Motion Time and Place According To William Ockham
164 pages
Foot Prints Without Feet English Class 10: Lesson
0% (1)
Foot Prints Without Feet English Class 10: Lesson
32 pages
Cuarto Grado Plan de Aula Inglés
No ratings yet
Cuarto Grado Plan de Aula Inglés
5 pages
MQL 5
No ratings yet
MQL 5
4,262 pages
How Different Number Systems Are Used in Computing
No ratings yet
How Different Number Systems Are Used in Computing
3 pages
nx8 Shop Doc cz1 799 PDF
No ratings yet
nx8 Shop Doc cz1 799 PDF
5 pages
Foxboro Evo Configuration Essentials
No ratings yet
Foxboro Evo Configuration Essentials
17 pages
Bank Token Display System
No ratings yet
Bank Token Display System
22 pages
Understanding Adverbial Phrases
No ratings yet
Understanding Adverbial Phrases
8 pages
Weekly Plan For 3 Years Old
No ratings yet
Weekly Plan For 3 Years Old
40 pages
Graphic Novel
No ratings yet
Graphic Novel
126 pages
phpCollab Installation and Upgrade Guide
No ratings yet
phpCollab Installation and Upgrade Guide
5 pages
Left Brained vs. Right Brained
No ratings yet
Left Brained vs. Right Brained
6 pages
Either... or Neither... Nor
No ratings yet
Either... or Neither... Nor
4 pages
Computer Science Project
No ratings yet
Computer Science Project
10 pages
Convolutional Codes & Viterbi Algorithm
100% (1)
Convolutional Codes & Viterbi Algorithm
138 pages
Top 10 Most Powerful Lakshmi Mantra For Money Wealth Abundance Fortune
100% (1)
Top 10 Most Powerful Lakshmi Mantra For Money Wealth Abundance Fortune
9 pages
Symphonie Fantastique: Berlioz's Programmatic Masterpiece
No ratings yet
Symphonie Fantastique: Berlioz's Programmatic Masterpiece
5 pages

ML Updated File

Uploaded by

ML Updated File

Uploaded by

Practical File

Submitted To: Submitted by:

Department of Information Technology

Panjab University Chandigarh

6. Train SVM on raw, normalized, and balanced 25-27

Yes! train_test_split from sklearn.model_selection is used to split a dataset into

· train_test_split() randomly divides New_Data into:

1. Importing Important Libraries and Loading Datasets

1.1 Libraries Used:

1.2 Loading the Dataset:

· X: Represents Driving Experience in years.

(3.)Find the least squares regression Iine by choosing appropriate dependent

# Compute regression coefficients b

(4.)Plots the scatter diagram and regression line

(5.)Computes correlation coefficient (r) and R²

(6.)Predicts insurance premium for 10 years of driving experience

(7.)Computes standard deviation of errors

(8.)Constructs a 90% confidence interval for B (slope)

# Compute SS_xx, SS_yy, SS_xy

# Find the regression line: y = a + b*x

# Calculate r and r^2 r = SS_xy / np.sqrt(SS_xx * SS_yy)

# Plot the scatter diagram and regression line

# Predict cholesterol for a 60 year old man age_pred = 60 cholesterol_pred = a +

# 95% Confidence interval for B SE_B = s / np.sqrt(SS_xx) alpha = 0.05

# Test at 5% if B is positive t_stat_b = b / SE_B t_critical_b =

# Test if correlation coefficient r is positive at α = 0.025

print(f"Critical t-value (one-tailed, alpha=0.025): {t_critical_r:.3f}")

# Load your dataset

# Assuming columns are named: "Years_of_Experience" and "Salary"

print("X_train.shape, Y_train.shape", X_train.shape, Y_train.shape)

# Scatter Plot (Actual vs Predicted)

# Drop 'id' and unnecessary unnamed columns

# Encode 'diagnosis' column (M=1, B=0)

# Confirm no missing values

# Separate features and target

# Standardize the features

# Train SVM model

# Drop 'id' and unnamed columns if exist

# Encode categorical columns

# Confirm clean dataset

# Separate features and target

# Dictionary to store results

# Function to train and evaluate SVM

# 1. SVM on Raw Data

# 2. Normalize the data

# SVM on Normalized Data

# 3. Balance the raw data

# SVM on Balanced Data

# Final results table

# Optional: Plot Accuracy, Precision, Recall, F1-Score

# Load the dataset

# Fill missing values

# Encode categorical columns

# Separate features and target

# Dictionaries to store results

# Function to train and evaluate model

# ------------------- Apply SVM -------------------

# ------------------- Apply KNN -------------------

# Balanced and Normalized Data

# ------------------- Results -------------------

# Comparison Table: SVM vs KNN (Accuracy only)

print("\nComparison between SVM and KNN (Accuracy only):")

# Load the dataset

# Fill missing values

# Encode categorical columns

# Separate features and target

# Function to train and evaluate model

# ------------------- Apply SVM -------------------

# ------------------- Apply Random Forest -------------------

# Balanced and Normalized Data

# ------------------- Final Results -------------------

# Comparison Table: SVM vs Random Forest (Accuracy only)

print("\nComparison between SVM and Random Forest (Accuracy only):")

You might also like