Machine Learning lab (BCSL606)
PGM 1:
Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Load California Housing dataset
data = pd.read_csv("California Housing\\housing.csv")
print (data.head(10))
data.describe()
plt.figure(figsize=(12, 8))
data.hist(bins=30, figsize=(12, 8), edgecolor='black')
plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()
# Generate box plots for all numerical features to identify outliers
plt.figure(figsize=(12, 8))
sns.boxplot(data=data, orient='h')
plt.title("Box Plots of Numerical Features", fontsize=16)
plt.show()
OUTPUT:
Machine Learning lab (BCSL606)
PGM 2
Develop a program to Compute the correlation matrix to understand the relationships between pairs
of features. Visualize the correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\California Housing\\
housing.csv")
# Display data types to check for non-numeric columns
print(df.dtypes)
# Select only numeric columns
df_numeric = df.select_dtypes(include=['number'])
# Compute the correlation matrix
corr_matrix = df_numeric.corr()
# Plot the heatmap for correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of California Housing Features")
plt.show()
# Pairplot to visualize pairwise relationships
sns.pairplot(df_numeric, diag_kind="kde", markers="+")
plt.show()
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)
Program 3:
Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
pip install scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load dataset from CSV file (Replace 'iris.csv' with your actual filename)
df = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\iris\\Iris.csv")
# Display first few rows to inspect data
print(df.head())
# Check for non-numeric columns
print(df.dtypes)
# Assuming the dataset has feature columns and a target column
# Identify feature and target columns
feature_columns = df.columns[:-1] # Select all except last column
target_column = df.columns[-1] # Last column is assumed to be the target
X = df[feature_columns] # Extract features
y = df[target_column] # Extract target labels
# Standardize the data (PCA is affected by scale)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Convert PCA output to DataFrame
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
Machine Learning lab (BCSL606)
df_pca['Target'] = y
# Plot the PCA result
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue=df_pca['Target'], palette="viridis", data=df_pca,
legend=True)
plt.title("PCA on Iris Dataset (2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Species")
plt.show()
Output:
Id SepalLengthCm SepalWidthCm PetalLengthCm
PetalWidthCm Species
0 1 5.1 3.5 1.4
0.2 Iris-setosa
1 2 4.9 3.0 1.4
0.2 Iris-setosa
2 3 4.7 3.2 1.3
0.2 Iris-setosa
3 4 4.6 3.1 1.5
0.2 Iris-setosa
4 5 5.0 3.6 1.4
0.2 Iris-setosa
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
Pgm 4:
Machine Learning lab (BCSL606)
For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Find-S algorithm to output a description of the set of all hypotheses consistent with the
training examples.
import pandas as pd
def find_s_algorithm(file_path):
data = pd.read_csv('C:\\Users\\Admin\\
Desktop\\Machie learig\\dataset\\Play Tennis\\
training_data.csv')
print("Training data:")
print(data)
attributes = data.columns[:-1]
class_label = data.columns[-1]
hypothesis = ['?' for _ in attributes]
for index, row in data.iterrows():
if row[class_label] == 'Yes':
for i, value in
enumerate(row[attributes]):
if hypothesis[i] == '?' or
hypothesis[i] == value:
hypothesis[i] = value
else:
hypothesis[i] = '?'
return hypothesis
file_path = 'training_data.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)
Training data:
Machine Learning lab (BCSL606)
Sky Temperature Humidity Wind
PlayTennis
0 Sunny Hot High Weak
Yes
1 Sunny Hot High Strong
Yes
2 Overcast Hot High Weak
Yes
3 Sunny Mild High Weak
Yes
4 Rainy Mild High Strong
No
5 Overcast Mild High Strong
Yes
6 Rainy Cool Normal Weak
No
7 Sunny Cool Normal Weak
Yes
The final hypothesis is: ['Sunny', '?', '?',
'Weak']
Machine Learning lab (BCSL606)
Pgm 5:
Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated. a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1 b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
Generate 100 random values between [0,1]
Assign class labels to first 50 values:
Class 1 if x≤0.5x \leq 0.5x≤0.5
Class 2 otherwise
Train KNN classifier using first 50 points
Classify the remaining 50 points using KNN for k=1,2,3,4,5,20,30
Plot the results to visualize classification
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# Step 1: Generate 100 random values in range [0,1]
np.random.seed(42) # For reproducibility
X = np.random.rand(100, 1) # Generate 100 random values in range [0,1]
# Step 2: Label first 50 points
y = np.array(["Class 1" if x <= 0.5 else "Class 2" for x in X[:50]])
# Step 3: Train KNN classifier using first 50 points
knn_results = {}
for k in [1, 2, 3, 4, 5, 20, 30]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X[:50], y) # Train on first 50 points
Machine Learning lab (BCSL606)
# Step 4: Classify remaining 50 points (x51 - x100)
y_pred = knn.predict(X[50:])
# Store results
knn_results[k] = y_pred
# Plot results
plt.figure(figsize=(6, 4))
plt.scatter(X[:50], np.zeros(50), c=['blue' if label == "Class 1" else 'red' for label in y],
label="Training Data color ")
plt.scatter(X[50:], np.zeros(50), c=['blue' if label == "Class 1" else 'red' for label in
y_pred], marker='x', label="Predicted Class")
plt.axvline(0.5, color='black', linestyle="dashed", label="Decision Boundary")
plt.title(f"KNN Classification (k={k})")
plt.legend()
plt.show()
# Print classification results
for k, predictions in knn_results.items():
print(f"\nResults for k={k}:")
print(predictions)
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)
Pgm 6:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
import numpy as np
import matplotlib.pyplot as plt
def gaussian_kernel(x, xi, tau):
return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))
def locally_weighted_regression(x, X, y, tau):
m = X.shape[0]
weights = np.array([gaussian_kernel(x, X[i], tau) for i in range(m)])
W = np.diag(weights)
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y
return x @ theta
np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100)
X_bias = np.c_[np.ones(X.shape), X]
Machine Learning lab (BCSL606)
x_test = np.linspace(0, 2 * np.pi, 200)
x_test_bias = np.c_[np.ones(x_test.shape), x_test]
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in x_test_bias])
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='red', label='Training Data', alpha=0.7)
plt.plot(x_test, y_pred, color='blue', label=f'LWR Fit (tau={tau})', linewidth=2)
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Locally Weighted Regression', fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()
return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))
(x - xi)2
Computes the squared difference between a query point xxx and a training point xi
Machine Learning lab (BCSL606)
If x and xi are vectors, np.sum((x - xi) 2) computes the Euclidean distance squared.
Divide by 2τ2
τ (tau) is the bandwidth parameter that controls the weighting.
The denominator 2τ2 ensures that the function properly scales.
m = X.shape[0]
X.shape[0] returns the number of rows in the matrix X.
The variable m stores this value, which represents the number of training data points.
The @ operator in Python is used for matrix multiplication. I
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y
return x @ theta
np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100)
X_bias = np.c_[np.ones(X.shape), X]
x_test = np.linspace(0, 2 * np.pi, 200)
x_test_bias = np.c_[np.ones(x_test.shape), x_test]
Machine Learning lab (BCSL606)
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in
x_test_bias])
Pgm 7:
Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
# Load Boston Housing dataset for Linear Regression
df_boston = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\Boston
Housing Dataset\\HousingData.csv") # Ensure correct file path
X_boston = df_boston[['RM']].values # Number of rooms as feature
Y_boston = df_boston[['MEDV']].values # Median house value as target
# Train Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(X_boston, Y_boston)
Y_pred_boston = lin_reg.predict(X_boston)
# Plot Linear Regression
plt.figure(figsize=(10, 6))
plt.scatter(X_boston, Y_boston, label='Original Data', alpha=0.6)
plt.plot(X_boston, Y_pred_boston, color='red', label='Linear Regression')
plt.xlabel('Average Number of Rooms')
plt.ylabel('Median House Value')
Machine Learning lab (BCSL606)
plt.title('Linear Regression on Boston Housing Dataset')
plt.legend()
plt.grid(True)
plt.show()
# Load Auto MPG dataset for Polynomial Regression
df_auto = pd.read_csv("C:\\Users\\Admin\\Desktop\\Machie learig\\dataset\\Auto MPG
Dataset\\auto-mpg.csv") # Ensure correct file path
df_auto = df_auto[df_auto['horsepower'] != '?'] # Remove missing values
df_auto['horsepower'] = df_auto['horsepower'].astype(float)
X_auto = df_auto[['horsepower']].values
Y_auto = df_auto[['mpg']].values
# Polynomial Regression (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_auto)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, Y_auto)
Y_pred_auto = poly_reg.predict(X_poly)
# Plot Polynomial Regression
plt.figure(figsize=(10, 6))
plt.scatter(X_auto, Y_auto, label='Original Data', alpha=0.6)
plt.scatter(X_auto, Y_pred_auto, color='red', label='Polynomial Regression', s=10)
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression on Auto MPG Dataset')
plt.legend()
plt.grid(True)
plt.show()
Machine Learning lab (BCSL606)
Machine Learning lab (BCSL606)
Pgm 8:
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and apply this knowledge to classify a new sample.
Machine Learning lab (BCSL606)
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
# Load Breast Cancer dataset
cancer = load_breast_cancer()
# Convert the dataset into a pandas DataFrame for easier
understanding
df = pd.DataFrame(cancer.data,
columns=cancer.feature_names)
df['target'] = cancer.target
# Print the first few rows of the dataset
print("Dataset Preview:\n", df.head())
# Split the data into features (X) and target (y)
X = df.drop(columns='target')
y = df['target']
# Split the data into training and test sets (80%
training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Initialize the DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy: {:.2f}%".format(accuracy * 100))
Machine Learning lab (BCSL606)
# New sample values (this should have 30 features,
matching the dataset)
new_sample = [[13.34, 15.4, 83.5, 508.2, 0.08, 0.052,
0.049, 0.065, 0.071, 0.029, 0.06, 0.039, 0.094, 0.082,
0.023, 0.048, 0.031, 0.014, 0.022, 0.021, 0.009, 0.029,
0.023, 0.006, 0.016, 0.014, 0.125, 0.133, 0.043, 0.020]]
# Correct new sample with 30 features
# Predict the class for the new sample
prediction = clf.predict(new_sample)
print("\nPrediction for the new sample (0 = Malignant, 1
= Benign):", prediction)
# Visualize the decision tree
import matplotlib.pyplot as plt
plt.figure(figsize=(15,10))
tree.plot_tree(clf, feature_names=cancer.feature_names,
class_names=cancer.target_names, filled=True,
rounded=True)
plt.show()
Machine Learning lab (BCSL606)
Pgm 10:
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix,
classification_report
data = load_breast_cancer()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y
plt.figure(figsize=(8, 6))
Machine Learning lab (BCSL606)
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True
Label', palette='coolwarm', s=100, edgecolor='black',
alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster',
palette='Set1', s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
Output:
Confusion Matrix:
[[175 37]
[ 13 344]]
Classification Report:
precision recall f1-score support
0 0.93 0.83 0.88 212
1 0.90 0.96 0.93 357
Machine Learning lab (BCSL606)
accuracy 0.91 569
macro avg 0.92 0.89 0.90 569
weighted avg 0.91 0.91 0.91 569
Machine Learning lab (BCSL606)