ML Lab Manual
ML Lab Manual
Enrolment No:
Name of Student:_
________________ ______________________
Head of Department Faculty
(Dr. B. J. Makwana)
Index
Sr. Page Date
Name Of Experiment Sign
No. No.
Introduction to Python Programming for Machine
1
Learning
Generate a synthetic data set and Use linear
2 regression technique to develop a model, and
evaluate on test samples.
Write a program for Logistic Regression to classify
3
IRIS data for two features (sepal length and width).
Write a program for the concept of decision tree to
4
develop a piecewise linear model and test it as well.
Write a program for kNN algorithm for classification
5
of IRIS dataset
Write a program using PCA algorithm for
dimensionality reduction in case of Olivetti dataset,
6
and follow it with KNN algorithm for face
recognition.
Write a program using Bayes algorithm for email
classification (spam or non-spam) for the
7
opensourced data set from the UC Irvine Machine
Learning Repository
Write a program using SVM on IRIS dataset and
8
carry out classification.
Write a program using SVM algorithm for Boston
9 house price prediction dataset to predict price of
houses from certain features
Experiment No. 1
Theory :
Important Libraries
2. Pandas
Pandas is an open-source Python Library used for high-performance data manipulation and
data analysis using its powerful data structures.
With the help of Pandas, in data processing we can accomplish the following five steps −
• Load
• Prepare
• Manipulate
• Model
• Analyze
1|Page
Key Features of Pandas
• Fast and efficient DataFrame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
• Series
• DataFrame
The SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and
efficient numerical practices such as routines for numerical integration and optimization. Together,
they run on all popular operating systems, are quick to install and are free of charge. NumPy and
SciPy are easy to use, but powerful enough to depend on by some of the world's leading scientists and
engineers.
2. Scikit-learn
2|Page
3. Matplotlib
• Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
• It has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc.
• It supports a very wide variety of graphs and plots namely - histogram, bar charts, power
spectra, error charts etc.
• It is used along with NumPy to provide an environment that is an effective open source
alternative for MatLab. It can also be used with graphics toolkits like PyQt and wxPython.
Hands on :
3|Page
4|Page
Output :
Conclusion:
5|Page
Experiment No. 2
Problem Statement:
Generate a synthetic data set using following function, and split it into training,
validation, and testing sample points. Use linear regression technique to develop a model,
and evaluate on test samples. ℵ is Gaussian noise.
𝒙
𝒚 = + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐
Steps:
1. Import libraries
2. Prepare data: First we will prepare some data for demonstrating linear regression. To
keep things simple we will assume we have a single input feature. Let us use thefollowing
function to generate our data:
𝒙
𝒚= + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐
3. Split the dataset into training, validation and test sets.
4. Evaluate the model.
Program:
1. Import libraries
import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline
2. Prepare data: First we will prepare some data for demonstrating linear regression. To
keep things simple we will assume we have a single input feature. Let us use thefollowing
function to generate our data:
𝒙
𝒚= + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐
number_of_samples = 100
x = np.linspace(-np.pi, np.pi, number_of_samples)
6|Page
y = 0.5*x+np.sin(x)+np.random.random(x.shape)
plt.scatter(x,y,color='black') #Plot y-vs-x in dots
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 1: Data for linear regression')
plt.show()
3. Split the dataset into training, validation and test sets :It is always encouraged
in machine learning to split the available data into
training, validation and test sets. The training set is supposed to be used to train the
model. The model is evaluated on the validation set after every episodeof training.
The performance on the validation set gives a measure of how good the model
generalizes. Various hyper parameters of the model are tuned to improve
performance on the validation set. Finally when the model is completely optimized
and ready for deployment, it is evaluated on the test data and theperformance is
reported in the final description of the model.
In this example we do a 70%−15%−15%70%−15%−15% random split of the data
between the training, validation and test sets respectively.
random_indices = np.random.permutation(number_of_samples)
#Training set
x_train = x[random_indices[:70]]
y_train = y[random_indices[:70]]
#Validation set
x_val = x[random_indices[70:85]]
y_val = y[random_indices[70:85]]
#Test set
x_test = x[random_indices[85:]]
y_test = y[random_indices[85:]]
7|Page
Fit a line to the data
Linear regression learns to fit a hyper plane to our data in the feature space. For one
dimensional data, the hyper plane reduces to a straight line. We will fit a line to our data
using sklearn.linear_model.LinearRegression
#sklearn takes the inputs as matrices. Hence we reshpae the arrays into column
matrices
x_train_for_line_fitting = np.matrix(x_train.reshape(len(x_train),1))
y_train_for_line_fitting = np.matrix(y_train.reshape(len(y_train),1))
Now that we have our model ready, we must evaluate our model. In a linear regression
scenario, its common to evaluate the model in terms of the mean squared error on the
validation and test sets.
Output :
Conclusion :
9|Page
Experiment No. 3
Problem Statement:
Write a program for Logistic Regression to classify IRIS data for two features
(sepal length and width).
Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).
Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.
10 | P a g e
Steps:
1. Import libraries
2. Prepare data
3. Evaluate the model.
Program:
1. Import libraries
import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline
2. Prepare data: The data has 4 input-features and 3 output-classes. For simplicity we will
use only two features: sepal-length and sepal-width (both in cm) and two output
classes: Iris Setosa and Iris Versicolour.
iris = datasets.load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target
#The first 50 samples are class 0 and the next 50 samples are class 1
X = X[:100]
Y = Y[:100]
11 | P a g e
number_of_samples = len(Y)
#Splitting into training, validation and test sets
random_indices = np.random.permutation(number_of_samples)
#Training set
num_training_samples = int(number_of_samples*0.7)
x_train = X[random_indices[:num_training_samples]]
y_train = Y[random_indices[:num_training_samples]]
#Validation set
num_validation_samples = int(number_of_samples*0.15)
x_val = X[random_indices[num_training_samples : num_training_samples+num_validation_sample
s]]
y_val = Y[random_indices[num_training_samples: num_training_samples+num_validation_samples
]]
#Test set
num_test_samples = int(number_of_samples*0.15)
x_test = X[random_indices[-num_test_samples:]]
y_test = Y[random_indices[-num_test_samples:]]
plt.scatter([X_class0[:,0]],[X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1'])
plt.title('Fig 3: Visualization of training data')
plt.show()
Now we fit a linear decision boundary through the feature space that separates the two classes well.
We use sklearn.linear_model.LogisticRegression.
plt.show()
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
validation_misclassification_percentage = 0
13 | P a g e
for i in range(len(validation_set_predictions)):
if validation_set_predictions[i]!=y_val[i]:
validation_misclassification_percentage+=1
validation_misclassification_percentage *= 100/len(y_val)
print ('validation misclassification percentage =', validation_misclassification_percentage, '%')
test_misclassification_percentage = 0
for i in range(len(test_set_predictions)):
if test_set_predictions[i]!=y_test[i]:
test_misclassification_percentage+=1
test_misclassification_percentage *= 100/len(y_test)
print ('test misclassification percentage =', test_misclassification_percentage, '%')
Output :
Conclusion :
14 | P a g e
Experiment No. 4
Problem Statement:
For the synthetic dataset used in experiment 2, write a program for the concept
of decision tree to develop a piecewise linear model and test it as well.
Steps:
1. Import libraries
2. Prepare data
3. Split the data into training, validation and test sets
4. Fit model
5. Evaluate the model.
Program:
1. Import libraries
import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline
2. Prepare data:
number_of_samples = 100
x = np.linspace(-np.pi, np.pi, number_of_samples)
y = 0.5*x+np.sin(x)+np.random.random(x.shape)
plt.scatter(x,y,color='black') #Plot y-vs-x in dots
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 5: Data for linear regression')
plt.show()
maximum_depth_of_tree = np.arange(10)+1
train_err_arr = []
val_err_arr = []
test_err_arr = []
model = tree.DecisionTreeRegressor(max_depth=depth)
#sklearn takes the inputs as matrices. Hence we reshpae the arrays into column matrices
x_train_for_line_fitting = np.matrix(x_train.reshape(len(x_train),1))
y_train_for_line_fitting = np.matrix(y_train.reshape(len(y_train),1))
train_err_arr.append(mean_train_error)
val_err_arr.append(mean_val_error)
16 | P a g e
test_err_arr.append(mean_test_error)
print ('Training MSE: ', mean_train_error, '\nValidation MSE: ', mean_val_error, '\nTest MSE:
', mean_test_error)
plt.figure()
plt.plot(train_err_arr,c='red')
plt.plot(val_err_arr,c='blue')
plt.plot(test_err_arr,c='green')
plt.legend(['Training error', 'Validation error', 'Test error'])
plt.title('Variation of error with maximum depth of tree')
plt.show()
Output :
Conclusion :
17 | P a g e
Experiment No. 5
Problem Statement:
Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).
Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.
18 | P a g e
Program:
1. Import Libraries
import numpy as np
from sklearn import datasets, neighbors, linear_model, tree
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris, fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from time import time
%matplotlib inline
2. Prepare dataset
First we will prepare the dataset. The dataset we choose is a modified version of the Iris dataset. We
choose only the first two input feature dimensions viz sepal-length and sepal-width (both in cm) for
ease of visualization.
iris = load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target
19 | P a g e
number_of_samples = len(Y)
print(number_of_samples)
#Test set
x_test = X[random_indices[num_training_samples:]]
y_test = Y[random_indices[num_training_samples:]]
plt.scatter([X_class0[:,0]],[ X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]],[ X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 1: Visualization of training data')
plt.show()
Note that the first class is linearly separable from the other two classes but the second and
third classes are not linearly separable from each other.
Now that our training data is ready we will jump right into the classification task. Just to remind you,
the K-nearest neighbor is a non-parametric learning algorithm and does not learn an parameterized
function that maps the input to the output. Rather it looks up the training set every time it is asked to
classify a point and finds out the K nearest neighbors of the query point. The class corresponding to
majority of the points is output as the class of the query point.
query_point = np.array([5.9,2.9])
true_class_of_query_point = 1
predicted_class_for_query_point = model.predict([query_point])
print("Query point: {}".format(query_point))
print("True class of query point: {}".format(true_class_of_query_point))
query_point.shape
neighbors_object = neighbors.NearestNeighbors(n_neighbors=10)
neighbors_object.fit(x_train)
distances_of_nearest_neighbors, indices_of_nearest_neighbors_of_query_point =
neighbors_object.kneighbors([query_point])
nearest_neighbors_of_query_point = x_train[indices_of_nearest_neighbors_of_qu
ery_point[0]]
print("The query point is: {}\n".format(query_point))
print("The nearest neighbors of the query point are:\n
{}\n".format(nearest_neighbors_of_query_point))
print("The classes of the nearest neighbors are:
{}\n".format(y_train[indices_of_nearest_neighbors_of_query_point[0]]))
print("Predicted class for query point:
{}".format(predicted_class_for_query_point[0]))
plt.scatter([X_class0[:,0]], [X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.scatter(query_point[0], query_point[1],marker='^',s=75,color='black')
plt.scatter(nearest_neighbors_of_query_point[:,0], nearest_neighbors_of_query
_point[:,1],marker='s',s=150,color='yellow',alpha=0.30)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 3: Working of the K-NN classification algorithm')
plt.show()
21 | P a g e
print("Evaluating K-NN classifier:")
test_err = evaluate_performance(model, x_test, y_test)
print('test misclassification percentage = {}%'.format(test_err))
Output :
Conclusion :
22 | P a g e
Experiment No. 6
Problem Statement:
Program:
1. Import Libraries
import numpy as np
from sklearn import datasets, neighbors, linear_model, tree
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris, fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from time import time
%matplotlib inline
2. Prepare dataset
23 | P a g e
First we will prepare the dataset. The dataset we choose is a modified version of the Iris dataset. We
choose only the first two input feature dimensions viz sepal-length and sepal-width (both in cm) for
ease of visualization.
iris = load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target
number_of_samples = len(Y)
print(number_of_samples)
#Test set
x_test = X[random_indices[num_training_samples:]]
y_test = Y[random_indices[num_training_samples:]]
plt.scatter([X_class0[:,0]],[ X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]],[ X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 1: Visualization of training data')
plt.show()
Note that the first class is linearly separable from the other two classes but the second and
third classes are not linearly separable from each other.
Now that our training data is ready we will jump right into the classification task. Just to remind you,
the K-nearest neighbor is a non-parametric learning algorithm and does not learn an parameterized
24 | P a g e
function that maps the input to the output. Rather it looks up the training set every time it is asked to
classify a point and finds out the K nearest neighbors of the query point. The class corresponding to
majority of the points is output as the class of the query point.
Let's see how the algorithm works. We choose the first point in the test set as our query point.
query_point = np.array([5.9,2.9])
true_class_of_query_point = 1
predicted_class_for_query_point = model.predict([query_point])
print("Query point: {}".format(query_point))
print("True class of query point: {}".format(true_class_of_query_point))
query_point.shape
neighbors_object = neighbors.NearestNeighbors(n_neighbors=10)
neighbors_object.fit(x_train)
distances_of_nearest_neighbors, indices_of_nearest_neighbors_of_query_point =
neighbors_object.kneighbors([query_point])
nearest_neighbors_of_query_point = x_train[indices_of_nearest_neighbors_of_qu
ery_point[0]]
print("The query point is: {}\n".format(query_point))
print("The nearest neighbors of the query point are:\n
{}\n".format(nearest_neighbors_of_query_point))
print("The classes of the nearest neighbors are:
{}\n".format(y_train[indices_of_nearest_neighbors_of_query_point[0]]))
print("Predicted class for query point:
{}".format(predicted_class_for_query_point[0]))
plt.scatter([X_class0[:,0]], [X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.scatter(query_point[0], query_point[1],marker='^',s=75,color='black')
plt.scatter(nearest_neighbors_of_query_point[:,0], nearest_neighbors_of_query
_point[:,1],marker='s',s=150,color='yellow',alpha=0.30)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 3: Working of the K-NN classification algorithm')
plt.show()
25 | P a g e
def evaluate_performance(model, x_test, y_test):
test_set_predictions = [model.predict(x_test[i].reshape((1,len(x_test[i])
)))[0] for i in range(x_test.shape[0])]
test_misclassification_percentage = 0
for i in range(len(test_set_predictions)):
if test_set_predictions[i]!=y_test[i]:
test_misclassification_percentage+=1
test_misclassification_percentage *= 100/len(y_test)
return test_misclassification_percentage
Output :
Conclusion :
26 | P a g e
Experiment No. 7
Problem Statement:
Write a program using Bayes algorithm for email classification (spam or non-spam)
for the opensourced data set from the UC Irvine Machine Learning Repository
Program:
import numpy as np
from sklearn.model_selection import train_test_split
datafile = open('C:/Users/AntennaPC/Desktop/spambase.data','r')
data = []
for line in datafile:
line = [float(element) for element in line.rstrip('\n').split(',')]
data.append(np.asarray(line))
num_features = 48
X = [data[i][:num_features] for i in range(len(data))]
y = [int(data[i][-1]) for i in range(len(data))]
num_class_0 = float(len(X_train_class_0))
27 | P a g e
num_class_1 = float(len(X_train_class_1))
log_prior_class_0 = np.log10(prior_probability_class_0)
log_prior_class_1 = np.log10(prior_probability_class_1)
return log_likelihood
def calculate_class_posteriors(feature_vector):
log_likelihood_class_0 =
calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=0)
log_likelihood_class_1 =
calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=1)
28 | P a g e
#Predict spam or not on the test set
predictions = []
for email in X_test:
predictions.append(classify_spam(email))
for i in range(100):
print predictions[i], y_test[i]
Output :
Conclusion :
29 | P a g e
Experiment No. 8
Problem Statement:
Write a program using SVM on IRIS dataset and carry out classification.
Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).
Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.
30 | P a g e
Program:
1. Import Libraries
2. Prepare dataset
iris = datasets.load_iris()
X = iris.data[:,:2]
y = iris.target
3. Use Support Vector Machine with different kinds of kernels and evaluate
performance
def evaluate_on_test_data(model=None):
predictions = model.predict(X_test)
correct_classifications = 0
for i in range(len(y_test)):
31 | P a g e
if predictions[i] == y_test[i]:
correct_classifications += 1
accuracy = 100*correct_classifications/len(y_test) #Accuracy as a
percentage
return accuracy
kernels = ('linear','poly','rbf')
accuracies = []
for index, kernel in enumerate(kernels):
model = svm.SVC(kernel=kernel)
model.fit(X_train, y_train)
acc = evaluate_on_test_data(model)
accuracies.append(acc)
print("{} % accuracy obtained with kernel = {}".format(acc,
kernel))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.show()
32 | P a g e
5. Check the support vectors
Output :
Conclusion :
33 | P a g e
Experiment No. 9
Problem Statement:
Write a program using SVM algorithm for Boston house price prediction dataset to predict
price of houses from certain features.
Program:
1. Import Libraries
3. Use Support Vector Machine with different kinds of kernels and evaluate
performance
def evaluate_on_test_data(model=None):
predictions = model.predict(X_test)
sum_of_squared_error = 0
for i in range(len(y_test)):
err = (predictions[i]-y_test[i]) **2
sum_of_squared_error += err
mean_squared_error = sum_of_squared_error/len(y_test)
RMSE = np.sqrt(mean_squared_error)
return RMSE
34 | P a g e
kernels = ('linear','rbf')
RMSE_vec = []
for index, kernel in enumerate(kernels):
model = svm.SVR(kernel=kernel)
model.fit(X_train, y_train)
RMSE = evaluate_on_test_data(model)
RMSE_vec.append(RMSE)
print("RMSE={} obtained with kernel = {}".format(RMSE, kernel))
Output :
Conclusion :
35 | P a g e