0% found this document useful (0 votes)
30 views23 pages

Lab Manual ML

The document outlines a series of programming assignments for a Machine Learning lab, covering various algorithms and techniques such as histograms, correlation matrices, PCA, Find-S algorithm, k-Nearest Neighbors, Locally Weighted Regression, Linear and Polynomial Regression, and Decision Trees. Each program includes code snippets for implementing the respective algorithms using datasets like California Housing, Iris, and Breast Cancer. The assignments aim to provide practical experience in data analysis and machine learning model implementation.

Uploaded by

zenitsu192004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views23 pages

Lab Manual ML

The document outlines a series of programming assignments for a Machine Learning lab, covering various algorithms and techniques such as histograms, correlation matrices, PCA, Find-S algorithm, k-Nearest Neighbors, Locally Weighted Regression, Linear and Polynomial Regression, and Decision Trees. Each program includes code snippets for implementing the respective algorithms using datasets like California Housing, Iris, and Breast Cancer. The assignments aim to provide practical experience in data analysis and machine learning model implementation.

Uploaded by

zenitsu192004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Learning lab (BCSL606)

PGM 1:

Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.

import matplotlib.pyplot as plt


import seaborn as sns
import pandas as pd

# Load California Housing dataset


data = pd.read_csv("California Housing\\housing.csv")
print (data.head(10))
data.describe()
plt.figure(figsize=(12, 8))
data.hist(bins=30, figsize=(12, 8), edgecolor='black')
plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()

# Generate box plots for all numerical features to identify outliers


plt.figure(figsize=(12, 8))
sns.boxplot(data=data, orient='h')
plt.title("Box Plots of Numerical Features", fontsize=16)
plt.show()

OUTPUT:
Machine Learning lab (BCSL606)

PGM 2

Develop a program to Compute the correlation matrix to understand the relationships between pairs
of features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\California Housing\\
housing.csv")

# Display data types to check for non-numeric columns


print(df.dtypes)

# Select only numeric columns


df_numeric = df.select_dtypes(include=['number'])

# Compute the correlation matrix


corr_matrix = df_numeric.corr()

# Plot the heatmap for correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of California Housing Features")
plt.show()

# Pairplot to visualize pairwise relationships


sns.pairplot(df_numeric, diag_kind="kde", markers="+")
plt.show()
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)

Program 3:

Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

pip install scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load dataset from CSV file (Replace 'iris.csv' with your actual filename)
df = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\iris\\Iris.csv")

# Display first few rows to inspect data


print(df.head())

# Check for non-numeric columns


print(df.dtypes)

# Assuming the dataset has feature columns and a target column


# Identify feature and target columns
feature_columns = df.columns[:-1] # Select all except last column
target_column = df.columns[-1] # Last column is assumed to be the target

X = df[feature_columns] # Extract features


y = df[target_column] # Extract target labels

# Standardize the data (PCA is affected by scale)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce to 2 components


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Convert PCA output to DataFrame


df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
Machine Learning lab (BCSL606)

df_pca['Target'] = y

# Plot the PCA result


plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue=df_pca['Target'], palette="viridis", data=df_pca,
legend=True)
plt.title("PCA on Iris Dataset (2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Species")
plt.show()

Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm


PetalWidthCm Species
0 1 5.1 3.5 1.4
0.2 Iris-setosa
1 2 4.9 3.0 1.4
0.2 Iris-setosa
2 3 4.7 3.2 1.3
0.2 Iris-setosa
3 4 4.6 3.1 1.5
0.2 Iris-setosa
4 5 5.0 3.6 1.4
0.2 Iris-setosa
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

Pgm 4:
Machine Learning lab (BCSL606)

For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-S algorithm to output a description of the set of all hypotheses consistent with the
training examples.

import pandas as pd

def find_s_algorithm(file_path):
data = pd.read_csv('C:\\Users\\Admin\\
Desktop\\Machie learig\\dataset\\Play Tennis\\
training_data.csv')

print("Training data:")
print(data)

attributes = data.columns[:-1]
class_label = data.columns[-1]

hypothesis = ['?' for _ in attributes]

for index, row in data.iterrows():


if row[class_label] == 'Yes':
for i, value in
enumerate(row[attributes]):
if hypothesis[i] == '?' or
hypothesis[i] == value:
hypothesis[i] = value
else:
hypothesis[i] = '?'

return hypothesis

file_path = 'training_data.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)
Training data:
Machine Learning lab (BCSL606)

Sky Temperature Humidity Wind


PlayTennis
0 Sunny Hot High Weak
Yes
1 Sunny Hot High Strong
Yes
2 Overcast Hot High Weak
Yes
3 Sunny Mild High Weak
Yes
4 Rainy Mild High Strong
No
5 Overcast Mild High Strong
Yes
6 Rainy Cool Normal Weak
No
7 Sunny Cool Normal Weak
Yes

The final hypothesis is: ['Sunny', '?', '?',


'Weak']
Machine Learning lab (BCSL606)

Pgm 5:

Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly


generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated. a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1 b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30

Generate 100 random values between [0,1]


Assign class labels to first 50 values:

 Class 1 if x≤0.5x \leq 0.5x≤0.5


 Class 2 otherwise
Train KNN classifier using first 50 points
Classify the remaining 50 points using KNN for k=1,2,3,4,5,20,30
Plot the results to visualize classification

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Step 1: Generate 100 random values in range [0,1]


np.random.seed(42) # For reproducibility
X = np.random.rand(100, 1) # Generate 100 random values in range [0,1]

# Step 2: Label first 50 points


y = np.array(["Class 1" if x <= 0.5 else "Class 2" for x in X[:50]])

# Step 3: Train KNN classifier using first 50 points


knn_results = {}

for k in [1, 2, 3, 4, 5, 20, 30]:


knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X[:50], y) # Train on first 50 points
Machine Learning lab (BCSL606)

# Step 4: Classify remaining 50 points (x51 - x100)


y_pred = knn.predict(X[50:])

# Store results
knn_results[k] = y_pred

# Plot results
plt.figure(figsize=(6, 4))
plt.scatter(X[:50], np.zeros(50), c=['blue' if label == "Class 1" else 'red' for label in y],
label="Training Data color ")
plt.scatter(X[50:], np.zeros(50), c=['blue' if label == "Class 1" else 'red' for label in
y_pred], marker='x', label="Predicted Class")
plt.axvline(0.5, color='black', linestyle="dashed", label="Decision Boundary")
plt.title(f"KNN Classification (k={k})")
plt.legend()
plt.show()

# Print classification results


for k, predictions in knn_results.items():
print(f"\nResults for k={k}:")
print(predictions)
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)

Pgm 6:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x, xi, tau):


return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))

def locally_weighted_regression(x, X, y, tau):


m = X.shape[0]
weights = np.array([gaussian_kernel(x, X[i], tau) for i in range(m)])
W = np.diag(weights)
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y
return x @ theta

np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100)
X_bias = np.c_[np.ones(X.shape), X]
Machine Learning lab (BCSL606)

x_test = np.linspace(0, 2 * np.pi, 200)


x_test_bias = np.c_[np.ones(x_test.shape), x_test]
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in x_test_bias])

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='red', label='Training Data', alpha=0.7)
plt.plot(x_test, y_pred, color='blue', label=f'LWR Fit (tau={tau})', linewidth=2)
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Locally Weighted Regression', fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()

return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))

(x - xi)2

 Computes the squared difference between a query point xxx and a training point xi
Machine Learning lab (BCSL606)

 If x and xi are vectors, np.sum((x - xi) 2) computes the Euclidean distance squared.

Divide by 2τ2

 τ (tau) is the bandwidth parameter that controls the weighting.


 The denominator 2τ2 ensures that the function properly scales.

m = X.shape[0]

 X.shape[0] returns the number of rows in the matrix X.


 The variable m stores this value, which represents the number of training data points.

The @ operator in Python is used for matrix multiplication. I

X_transpose_W = X.T @ W

theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y

return x @ theta

np.random.seed(42)

X = np.linspace(0, 2 * np.pi, 100)

y = np.sin(X) + 0.1 * np.random.randn(100)

X_bias = np.c_[np.ones(X.shape), X]

x_test = np.linspace(0, 2 * np.pi, 200)

x_test_bias = np.c_[np.ones(x_test.shape), x_test]


Machine Learning lab (BCSL606)

tau = 0.5

y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in


x_test_bias])

Pgm 7:
Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Load Boston Housing dataset for Linear Regression


df_boston = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\Boston
Housing Dataset\\HousingData.csv") # Ensure correct file path
X_boston = df_boston[['RM']].values # Number of rooms as feature
Y_boston = df_boston[['MEDV']].values # Median house value as target

# Train Linear Regression Model


lin_reg = LinearRegression()
lin_reg.fit(X_boston, Y_boston)
Y_pred_boston = lin_reg.predict(X_boston)

# Plot Linear Regression


plt.figure(figsize=(10, 6))
plt.scatter(X_boston, Y_boston, label='Original Data', alpha=0.6)
plt.plot(X_boston, Y_pred_boston, color='red', label='Linear Regression')
plt.xlabel('Average Number of Rooms')
plt.ylabel('Median House Value')
Machine Learning lab (BCSL606)

plt.title('Linear Regression on Boston Housing Dataset')


plt.legend()
plt.grid(True)
plt.show()

# Load Auto MPG dataset for Polynomial Regression


df_auto = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\Auto MPG
Dataset\\auto-mpg.csv") # Ensure correct file path
df_auto = df_auto[df_auto['horsepower'] != '?'] # Remove missing values
df_auto['horsepower'] = df_auto['horsepower'].astype(float)
X_auto = df_auto[['horsepower']].values
Y_auto = df_auto[['mpg']].values

# Polynomial Regression (degree 2)


poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_auto)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, Y_auto)
Y_pred_auto = poly_reg.predict(X_poly)

# Plot Polynomial Regression


plt.figure(figsize=(10, 6))
plt.scatter(X_auto, Y_auto, label='Original Data', alpha=0.6)
plt.scatter(X_auto, Y_pred_auto, color='red', label='Polynomial Regression', s=10)
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression on Auto MPG Dataset')
plt.legend()
plt.grid(True)
plt.show()
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)

Pgm 8:
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
Machine Learning lab (BCSL606)

# Import necessary libraries


import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

# Load Breast Cancer dataset


cancer = load_breast_cancer()

# Convert the dataset into a pandas DataFrame for easier


understanding
df = pd.DataFrame(cancer.data,
columns=cancer.feature_names)
df['target'] = cancer.target

# Print the first few rows of the dataset


print("Dataset Preview:\n", df.head())

# Split the data into features (X) and target (y)


X = df.drop(columns='target')
y = df['target']

# Split the data into training and test sets (80%


training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Initialize the DecisionTreeClassifier


clf = DecisionTreeClassifier(random_state=42)

# Train the model


clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)

# Evaluate the model accuracy


accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy: {:.2f}%".format(accuracy * 100))
Machine Learning lab (BCSL606)

# New sample values (this should have 30 features,


matching the dataset)
new_sample = [[13.34, 15.4, 83.5, 508.2, 0.08, 0.052,
0.049, 0.065, 0.071, 0.029, 0.06, 0.039, 0.094, 0.082,
0.023, 0.048, 0.031, 0.014, 0.022, 0.021, 0.009, 0.029,
0.023, 0.006, 0.016, 0.014, 0.125, 0.133, 0.043, 0.020]]
# Correct new sample with 30 features

# Predict the class for the new sample


prediction = clf.predict(new_sample)
print("\nPrediction for the new sample (0 = Malignant, 1
= Benign):", prediction)

# Visualize the decision tree


import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))
tree.plot_tree(clf, feature_names=cancer.feature_names,
class_names=cancer.target_names, filled=True,
rounded=True)
plt.show()
Machine Learning lab (BCSL606)

Pgm 10:
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix,
classification_report

data = load_breast_cancer()
X = data.data
y = data.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)


y_kmeans = kmeans.fit_predict(X_scaled)

print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])


df['Cluster'] = y_kmeans
df['True Label'] = y

plt.figure(figsize=(8, 6))
Machine Learning lab (BCSL606)

sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',


palette='Set1', s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True
Label', palette='coolwarm', s=100, edgecolor='black',
alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

Output:

Confusion Matrix:
[[175 37]
[ 13 344]]

Classification Report:
precision recall f1-score support

0 0.93 0.83 0.88 212


1 0.90 0.96 0.93 357
Machine Learning lab (BCSL606)

accuracy 0.91 569


macro avg 0.92 0.89 0.90 569
weighted avg 0.91 0.91 0.91 569
Machine Learning lab (BCSL606)

You might also like