0% found this document useful (0 votes)
119 views38 pages

ML Lab Manual

The document describes generating a synthetic dataset using the function y = 0.5x + sin(x) + א, where א is Gaussian noise. It splits the data into training, validation and test sets. It then fits a linear regression model to the training data and evaluates the model on the test set. Key steps include importing libraries, generating the synthetic data, visualizing it, splitting the data into sets, fitting a linear regression model using scikit-learn, and evaluating the model on the test set.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views38 pages

ML Lab Manual

The document describes generating a synthetic dataset using the function y = 0.5x + sin(x) + א, where א is Gaussian noise. It splits the data into training, validation and test sets. It then fits a linear regression model to the training data and evaluates the model on the test set. Key steps include importing libraries, generating the synthetic data, visualizing it, splitting the data into sets, fitting a linear regression model using scikit-learn, and evaluating the model on the test set.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Lab Manual for

Introduction to Machine Learning


(Subject Code: 3171114)

B.E. 7th Semester

Name of Faculty: Dr. B. J. Makwana

Enrolment No:

Name of Student:_

Government Engineering College, Rajkot


Electronics and Communication Engineering
Department
CERTIFICATE

This is to certify that Miss / Mr. ______________________________________,


of Semester 7th Enrolment number _______________________________ has
satisfactorily completed his/her laboratory work for the subject “Introduction to
Machine Learning (3171114)” as per G.T.U. guidelines in the academic year
__________________ .

Date of submission: _______________

________________ ______________________
Head of Department Faculty
(Dr. B. J. Makwana)
Index
Sr. Page Date
Name Of Experiment Sign
No. No.
Introduction to Python Programming for Machine
1
Learning
Generate a synthetic data set and Use linear
2 regression technique to develop a model, and
evaluate on test samples.
Write a program for Logistic Regression to classify
3
IRIS data for two features (sepal length and width).
Write a program for the concept of decision tree to
4
develop a piecewise linear model and test it as well.
Write a program for kNN algorithm for classification
5
of IRIS dataset
Write a program using PCA algorithm for
dimensionality reduction in case of Olivetti dataset,
6
and follow it with KNN algorithm for face
recognition.
Write a program using Bayes algorithm for email
classification (spam or non-spam) for the
7
opensourced data set from the UC Irvine Machine
Learning Repository
Write a program using SVM on IRIS dataset and
8
carry out classification.
Write a program using SVM algorithm for Boston
9 house price prediction dataset to predict price of
houses from certain features
Experiment No. 1

Introduction to Python Programming for Machine Learning

Problem Statement: Run the commands using Anaconda- Jupyter notebook

Theory :

Important Libraries

1. NumPy : Numerical Python


• It is useful component that makes Python as one of the favorite languages for Data
Science.
• It basically stands for Numerical Python and consists of multidimensional array
objects.
• By using NumPy, we can perform the following important operations −
▪ Mathematical and logical operations on arrays.
▪ Fourier transformation
▪ Operations associated with linear algebra.
We can also see NumPy as the replacement of MatLab because NumPy is mostly used along
with Scipy (Scientific Python) and Mat-plotlib (plotting library).

2. Pandas

Pandas is an open-source Python Library used for high-performance data manipulation and
data analysis using its powerful data structures.

With the help of Pandas, in data processing we can accomplish the following five steps −

• Load
• Prepare
• Manipulate
• Model
• Analyze

1|Page
Key Features of Pandas

• Fast and efficient DataFrame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.

Pandas deals with the following data structures −

• Series
• DataFrame

1. Scipy : Scientific Python

The SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and
efficient numerical practices such as routines for numerical integration and optimization. Together,
they run on all popular operating systems, are quick to install and are free of charge. NumPy and
SciPy are easy to use, but powerful enough to depend on by some of the world's leading scientists and
engineers.

2. Scikit-learn

The following are some features of Scikit-learn that makes it so useful −


• It is built on NumPy, SciPy, and Matplotlib.
• It is an open source
• Wide range of machine learning algorithms covering major areas of ML like classification,
clustering, regression, dimensionality reduction, model selection etc. can be implemented
with the help of it.

2|Page
3. Matplotlib
• Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
• It has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc.
• It supports a very wide variety of graphs and plots namely - histogram, bar charts, power
spectra, error charts etc.
• It is used along with NumPy to provide an environment that is an effective open source
alternative for MatLab. It can also be used with graphics toolkits like PyQt and wxPython.

Hands on :

3|Page
4|Page
Output :

Conclusion:

5|Page
Experiment No. 2
Problem Statement:

Generate a synthetic data set using following function, and split it into training,
validation, and testing sample points. Use linear regression technique to develop a model,
and evaluate on test samples. ℵ is Gaussian noise.
𝒙
𝒚 = + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐

Steps:
1. Import libraries
2. Prepare data: First we will prepare some data for demonstrating linear regression. To
keep things simple we will assume we have a single input feature. Let us use thefollowing
function to generate our data:
𝒙
𝒚= + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐
3. Split the dataset into training, validation and test sets.
4. Evaluate the model.

Program:
1. Import libraries

import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline
2. Prepare data: First we will prepare some data for demonstrating linear regression. To
keep things simple we will assume we have a single input feature. Let us use thefollowing
function to generate our data:
𝒙
𝒚= + 𝐬𝐢𝐧(𝒙) + ℵ
𝟐

number_of_samples = 100
x = np.linspace(-np.pi, np.pi, number_of_samples)
6|Page
y = 0.5*x+np.sin(x)+np.random.random(x.shape)
plt.scatter(x,y,color='black') #Plot y-vs-x in dots
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 1: Data for linear regression')
plt.show()

3. Split the dataset into training, validation and test sets :It is always encouraged
in machine learning to split the available data into
training, validation and test sets. The training set is supposed to be used to train the
model. The model is evaluated on the validation set after every episodeof training.
The performance on the validation set gives a measure of how good the model
generalizes. Various hyper parameters of the model are tuned to improve
performance on the validation set. Finally when the model is completely optimized
and ready for deployment, it is evaluated on the test data and theperformance is
reported in the final description of the model.
In this example we do a 70%−15%−15%70%−15%−15% random split of the data
between the training, validation and test sets respectively.

random_indices = np.random.permutation(number_of_samples)
#Training set
x_train = x[random_indices[:70]]
y_train = y[random_indices[:70]]
#Validation set

x_val = x[random_indices[70:85]]
y_val = y[random_indices[70:85]]
#Test set
x_test = x[random_indices[85:]]
y_test = y[random_indices[85:]]

7|Page
Fit a line to the data

Linear regression learns to fit a hyper plane to our data in the feature space. For one
dimensional data, the hyper plane reduces to a straight line. We will fit a line to our data
using sklearn.linear_model.LinearRegression

model = linear_model.LinearRegression() #Create a least squared error linear regression


object

#sklearn takes the inputs as matrices. Hence we reshpae the arrays into column
matrices
x_train_for_line_fitting = np.matrix(x_train.reshape(len(x_train),1))
y_train_for_line_fitting = np.matrix(y_train.reshape(len(y_train),1))

#Fit the line to the training data


model.fit(x_train_for_line_fitting, y_train_for_line_fitting)

#Plot the line


plt.scatter(x_train, y_train, color='black')
plt.plot(x.reshape((len(x),1)),model.predict(x.reshape((len(x),1))),color='blue')
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 2: Line fit to training data')
plt.show()

4. Evaluate the model

Now that we have our model ready, we must evaluate our model. In a linear regression
scenario, its common to evaluate the model in terms of the mean squared error on the
validation and test sets.

mean_val_error = np.mean( (y_val - model.predict(x_val.reshape(len(x_val),1)))**2 )


mean_test_error = np.mean( (y_test - model.predict(x_test.reshape(len(x_test),1)))**2 )
8|Page
print ('Validation MSE: ', mean_val_error, '\nTest MSE: ', mean_test_error)

Output :

Conclusion :

9|Page
Experiment No. 3

Problem Statement:

Write a program for Logistic Regression to classify IRIS data for two features
(sepal length and width).

Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).

Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.

10 | P a g e
Steps:
1. Import libraries
2. Prepare data
3. Evaluate the model.

Program:
1. Import libraries

import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline

2. Prepare data: The data has 4 input-features and 3 output-classes. For simplicity we will
use only two features: sepal-length and sepal-width (both in cm) and two output
classes: Iris Setosa and Iris Versicolour.
iris = datasets.load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target
#The first 50 samples are class 0 and the next 50 samples are class 1
X = X[:100]
Y = Y[:100]

11 | P a g e
number_of_samples = len(Y)
#Splitting into training, validation and test sets
random_indices = np.random.permutation(number_of_samples)
#Training set
num_training_samples = int(number_of_samples*0.7)
x_train = X[random_indices[:num_training_samples]]
y_train = Y[random_indices[:num_training_samples]]
#Validation set
num_validation_samples = int(number_of_samples*0.15)
x_val = X[random_indices[num_training_samples : num_training_samples+num_validation_sample
s]]
y_val = Y[random_indices[num_training_samples: num_training_samples+num_validation_samples
]]
#Test set
num_test_samples = int(number_of_samples*0.15)
x_test = X[random_indices[-num_test_samples:]]
y_test = Y[random_indices[-num_test_samples:]]

#Visualizing the training data


X_class0 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_train[i]==0]) #Picking only the first
two classes
Y_class0 = np.zeros((X_class0.shape[0]),dtype=np.int)
X_class1 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_train[i]==1])
Y_class1 = np.ones((X_class1.shape[0]),dtype=np.int)

plt.scatter([X_class0[:,0]],[X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1'])
plt.title('Fig 3: Visualization of training data')
plt.show()

3. Fit logistic regression model

Now we fit a linear decision boundary through the feature space that separates the two classes well.
We use sklearn.linear_model.LogisticRegression.

model = linear_model.LogisticRegression(C=1e5)#C is the inverse of the regularization factor


12 | P a g e
full_X = np.concatenate((X_class0,X_class1),axis=0)
full_Y = np.concatenate((Y_class0,Y_class1),axis=0)
model.fit(full_X,full_Y)

# Display the decision boundary


#(Visualization code taken from: http://scikit-
learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html) #
Plot the decision boundary. For that, we will assign a color to each#
point in the mesh [x_min, m_max]x[y_min, y_max].

h = .02 # step size in the mesh


x_min, x_max = full_X[:, 0].min() - .5, full_X[:, 0].max() + .5
y_min, y_max = full_X[:, 1].min() - .5, full_X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) #predict for the entire mesh to find the regions for
each class in the feature space

# Put the result into a color plot


Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points


plt.scatter([X_class0[:, 0]], [X_class0[:, 1]], c='red', edgecolors='k', cmap=plt.cm.Paired)
plt.scatter([X_class1[:, 0]], [X_class1[:, 1]], c='blue', edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Fig 4: Visualization of decision boundary')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.show()
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

4. Evaluate the model


We calculate the validation and test misclassification errors.

validation_set_predictions = [model.predict(x_val[i].reshape((1,2)))[0] for i in range(x_val.shape[0])]

validation_misclassification_percentage = 0

13 | P a g e
for i in range(len(validation_set_predictions)):
if validation_set_predictions[i]!=y_val[i]:
validation_misclassification_percentage+=1
validation_misclassification_percentage *= 100/len(y_val)
print ('validation misclassification percentage =', validation_misclassification_percentage, '%')

test_set_predictions = [model.predict(x_test[i].reshape((1,2)))[0] for i in range(x_test.shape[0])]

test_misclassification_percentage = 0
for i in range(len(test_set_predictions)):
if test_set_predictions[i]!=y_test[i]:
test_misclassification_percentage+=1
test_misclassification_percentage *= 100/len(y_test)
print ('test misclassification percentage =', test_misclassification_percentage, '%')

Output :

Conclusion :

14 | P a g e
Experiment No. 4

Problem Statement:

For the synthetic dataset used in experiment 2, write a program for the concept
of decision tree to develop a piecewise linear model and test it as well.

Steps:
1. Import libraries
2. Prepare data
3. Split the data into training, validation and test sets
4. Fit model
5. Evaluate the model.

Program:
1. Import libraries

import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt %matplotlib inline

2. Prepare data:

number_of_samples = 100
x = np.linspace(-np.pi, np.pi, number_of_samples)
y = 0.5*x+np.sin(x)+np.random.random(x.shape)
plt.scatter(x,y,color='black') #Plot y-vs-x in dots
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 5: Data for linear regression')
plt.show()

3. Split the data into training, validation and test sets


15 | P a g e
random_indices = np.random.permutation(number_of_samples)
#Training set
x_train = x[random_indices[:70]]
y_train = y[random_indices[:70]]
#Validation set
x_val = x[random_indices[70:85]]
y_val = y[random_indices[70:85]]
#Test set
x_test = x[random_indices[85:]]
y_test = y[random_indices[85:]]

4. Fit a line to the data

maximum_depth_of_tree = np.arange(10)+1
train_err_arr = []
val_err_arr = []
test_err_arr = []

for depth in maximum_depth_of_tree:

model = tree.DecisionTreeRegressor(max_depth=depth)
#sklearn takes the inputs as matrices. Hence we reshpae the arrays into column matrices
x_train_for_line_fitting = np.matrix(x_train.reshape(len(x_train),1))
y_train_for_line_fitting = np.matrix(y_train.reshape(len(y_train),1))

#Fit the line to the training data


model.fit(x_train_for_line_fitting, y_train_for_line_fitting)

#Plot the line


plt.figure()
plt.scatter(x_train, y_train, color='black')
plt.plot(x.reshape((len(x),1)),model.predict(x.reshape((len(x),1))),color='blue')
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Line fit to training data with max_depth='+str(depth))
plt.show()
5. Evaluate the model.

mean_train_error = np.mean( (y_train - model.predict(x_train.reshape(len(x_train),1)))**2 )


mean_val_error = np.mean( (y_val - model.predict(x_val.reshape(len(x_val),1)))**2 )
mean_test_error = np.mean( (y_test - model.predict(x_test.reshape(len(x_test),1)))**2 )

train_err_arr.append(mean_train_error)
val_err_arr.append(mean_val_error)
16 | P a g e
test_err_arr.append(mean_test_error)

print ('Training MSE: ', mean_train_error, '\nValidation MSE: ', mean_val_error, '\nTest MSE:
', mean_test_error)

plt.figure()
plt.plot(train_err_arr,c='red')
plt.plot(val_err_arr,c='blue')
plt.plot(test_err_arr,c='green')
plt.legend(['Training error', 'Validation error', 'Test error'])
plt.title('Variation of error with maximum depth of tree')
plt.show()

Output :

Conclusion :

17 | P a g e
Experiment No. 5

Problem Statement:

Write a program for kNN algorithm for classification of IRIS dataset.

Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).

Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.

18 | P a g e
Program:
1. Import Libraries

from future import print_function

import numpy as np
from sklearn import datasets, neighbors, linear_model, tree
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris, fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from time import time
%matplotlib inline

2. Prepare dataset

First we will prepare the dataset. The dataset we choose is a modified version of the Iris dataset. We
choose only the first two input feature dimensions viz sepal-length and sepal-width (both in cm) for
ease of visualization.
iris = load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target
19 | P a g e
number_of_samples = len(Y)

print(number_of_samples)

#Splitting into training and test sets


random_indices = np.random.permutation(number_of_samples)
#Training set
num_training_samples = int(number_of_samples*0.75)
x_train = X[random_indices[:num_training_samples]]
y_train = Y[random_indices[:num_training_samples]]

#Test set
x_test = X[random_indices[num_training_samples:]]
y_test = Y[random_indices[num_training_samples:]]

#Visualizing the training data


X_class0 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==0]) #Picking only the first two classes
Y_class0 = np.zeros((X_class0.shape[0]),dtype=np.int)
X_class1 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==1])
Y_class1 = np.ones((X_class1.shape[0]),dtype=np.int)
X_class2 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==2])
Y_class2 = np.full((X_class2.shape[0]),fill_value=2,dtype=np.int)

plt.scatter([X_class0[:,0]],[ X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]],[ X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 1: Visualization of training data')
plt.show()

Note that the first class is linearly separable from the other two classes but the second and
third classes are not linearly separable from each other.

3. K-nearest neighbour classifier algorithm

Now that our training data is ready we will jump right into the classification task. Just to remind you,
the K-nearest neighbor is a non-parametric learning algorithm and does not learn an parameterized
function that maps the input to the output. Rather it looks up the training set every time it is asked to
classify a point and finds out the K nearest neighbors of the query point. The class corresponding to
majority of the points is output as the class of the query point.

model = neighbors.KNeighborsClassifier(n_neighbors = 10) # K = 10


model.fit(x_train, y_train)
4. Visualize the working of the algorithm
20 | P a g e
Let's see how the algorithm works. We choose the first point in the test set as our query point.

query_point = np.array([5.9,2.9])
true_class_of_query_point = 1
predicted_class_for_query_point = model.predict([query_point])
print("Query point: {}".format(query_point))
print("True class of query point: {}".format(true_class_of_query_point))
query_point.shape

Let's visualize the point and its K=5 nearest neighbors.

neighbors_object = neighbors.NearestNeighbors(n_neighbors=10)
neighbors_object.fit(x_train)
distances_of_nearest_neighbors, indices_of_nearest_neighbors_of_query_point =
neighbors_object.kneighbors([query_point])
nearest_neighbors_of_query_point = x_train[indices_of_nearest_neighbors_of_qu
ery_point[0]]
print("The query point is: {}\n".format(query_point))
print("The nearest neighbors of the query point are:\n
{}\n".format(nearest_neighbors_of_query_point))
print("The classes of the nearest neighbors are:
{}\n".format(y_train[indices_of_nearest_neighbors_of_query_point[0]]))
print("Predicted class for query point:
{}".format(predicted_class_for_query_point[0]))

plt.scatter([X_class0[:,0]], [X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.scatter(query_point[0], query_point[1],marker='^',s=75,color='black')
plt.scatter(nearest_neighbors_of_query_point[:,0], nearest_neighbors_of_query
_point[:,1],marker='s',s=150,color='yellow',alpha=0.30)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 3: Working of the K-NN classification algorithm')
plt.show()

def evaluate_performance(model, x_test, y_test):


test_set_predictions = [model.predict(x_test[i].reshape((1,len(x_test[i])
)))[0] for i in range(x_test.shape[0])]
test_misclassification_percentage = 0
for i in range(len(test_set_predictions)):
if test_set_predictions[i]!=y_test[i]:
test_misclassification_percentage+=1
test_misclassification_percentage *= 100/len(y_test)
return test_misclassification_percentage
5. Evaluate the performances on the validation and test sets

21 | P a g e
print("Evaluating K-NN classifier:")
test_err = evaluate_performance(model, x_test, y_test)
print('test misclassification percentage = {}%'.format(test_err))

Output :

Conclusion :

22 | P a g e
Experiment No. 6

Problem Statement:

Write a program using PCA algorithm for dimensionality reduction in case of


Olivetti dataset, and follow it with KNN algorithm for face recognition.

Background : The Olivetti faces dataset

Brief information about Olivetti Dataset:

• Face images taken between April 1992 and April 1994.


• There are ten different image of each of 40 distinct people
• There are 400 face images in the dataset
• Face images were taken at different times, variying ligthing, facial express and facial detail
• All face images have black background
• The images are gray level
• Size of each image is 64x64
• Image pixel values were scaled to [0, 1] interval
• Names of 40 people were encoded to an integer from 0 to 39

Program:
1. Import Libraries

from future import print_function

import numpy as np
from sklearn import datasets, neighbors, linear_model, tree
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris, fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA as RandomizedPCA
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from time import time
%matplotlib inline

2. Prepare dataset
23 | P a g e
First we will prepare the dataset. The dataset we choose is a modified version of the Iris dataset. We
choose only the first two input feature dimensions viz sepal-length and sepal-width (both in cm) for
ease of visualization.
iris = load_iris()
X = iris.data[:,:2] #Choosing only the first two input-features
Y = iris.target

number_of_samples = len(Y)

print(number_of_samples)

#Splitting into training and test sets


random_indices = np.random.permutation(number_of_samples)
#Training set
num_training_samples = int(number_of_samples*0.75)
x_train = X[random_indices[:num_training_samples]]
y_train = Y[random_indices[:num_training_samples]]

#Test set
x_test = X[random_indices[num_training_samples:]]
y_test = Y[random_indices[num_training_samples:]]

#Visualizing the training data


X_class0 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==0]) #Picking only the first two classes
Y_class0 = np.zeros((X_class0.shape[0]),dtype=np.int)
X_class1 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==1])
Y_class1 = np.ones((X_class1.shape[0]),dtype=np.int)
X_class2 = np.asmatrix([x_train[i] for i in range(len(x_train)) if y_tr
ain[i]==2])
Y_class2 = np.full((X_class2.shape[0]),fill_value=2,dtype=np.int)

plt.scatter([X_class0[:,0]],[ X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]],[ X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 1: Visualization of training data')
plt.show()

Note that the first class is linearly separable from the other two classes but the second and
third classes are not linearly separable from each other.

3. K-nearest neighbour classifier algorithm

Now that our training data is ready we will jump right into the classification task. Just to remind you,
the K-nearest neighbor is a non-parametric learning algorithm and does not learn an parameterized
24 | P a g e
function that maps the input to the output. Rather it looks up the training set every time it is asked to
classify a point and finds out the K nearest neighbors of the query point. The class corresponding to
majority of the points is output as the class of the query point.

model = neighbors.KNeighborsClassifier(n_neighbors = 10) # K = 10


model.fit(x_train, y_train)

4. Visualize the working of the algorithm

Let's see how the algorithm works. We choose the first point in the test set as our query point.

query_point = np.array([5.9,2.9])
true_class_of_query_point = 1
predicted_class_for_query_point = model.predict([query_point])
print("Query point: {}".format(query_point))
print("True class of query point: {}".format(true_class_of_query_point))
query_point.shape

Let's visualize the point and its K=5 nearest neighbors.

neighbors_object = neighbors.NearestNeighbors(n_neighbors=10)
neighbors_object.fit(x_train)
distances_of_nearest_neighbors, indices_of_nearest_neighbors_of_query_point =
neighbors_object.kneighbors([query_point])
nearest_neighbors_of_query_point = x_train[indices_of_nearest_neighbors_of_qu
ery_point[0]]
print("The query point is: {}\n".format(query_point))
print("The nearest neighbors of the query point are:\n
{}\n".format(nearest_neighbors_of_query_point))
print("The classes of the nearest neighbors are:
{}\n".format(y_train[indices_of_nearest_neighbors_of_query_point[0]]))
print("Predicted class for query point:
{}".format(predicted_class_for_query_point[0]))

plt.scatter([X_class0[:,0]], [X_class0[:,1]],color='red')
plt.scatter([X_class1[:,0]], [X_class1[:,1]],color='blue')
plt.scatter([X_class2[:,0]], [X_class2[:,1]],color='green')
plt.scatter(query_point[0], query_point[1],marker='^',s=75,color='black')
plt.scatter(nearest_neighbors_of_query_point[:,0], nearest_neighbors_of_query
_point[:,1],marker='s',s=150,color='yellow',alpha=0.30)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend(['class 0','class 1','class 2'])
plt.title('Fig 3: Working of the K-NN classification algorithm')
plt.show()

25 | P a g e
def evaluate_performance(model, x_test, y_test):
test_set_predictions = [model.predict(x_test[i].reshape((1,len(x_test[i])
)))[0] for i in range(x_test.shape[0])]
test_misclassification_percentage = 0
for i in range(len(test_set_predictions)):
if test_set_predictions[i]!=y_test[i]:
test_misclassification_percentage+=1
test_misclassification_percentage *= 100/len(y_test)
return test_misclassification_percentage

5. Evaluate the performances on the validation and test sets

print("Evaluating K-NN classifier:")


test_err = evaluate_performance(model, x_test, y_test)
print('test misclassification percentage = {}%'.format(test_err))

Output :

Conclusion :

26 | P a g e
Experiment No. 7

Problem Statement:

Write a program using Bayes algorithm for email classification (spam or non-spam)
for the opensourced data set from the UC Irvine Machine Learning Repository

Program:
import numpy as np
from sklearn.model_selection import train_test_split

datafile = open('C:/Users/AntennaPC/Desktop/spambase.data','r')

# Download spambase.data from the MSTeam of this course, Save it and


give file path from your pc

data = []
for line in datafile:
line = [float(element) for element in line.rstrip('\n').split(',')]
data.append(np.asarray(line))

num_features = 48
X = [data[i][:num_features] for i in range(len(data))]
y = [int(data[i][-1]) for i in range(len(data))]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,


random_state=42)

#Making likelihood estimations

#Find the two classes

X_train_class_0 = [X_train[i] for i in range(len(X_train)) if y_train[i]==0]


X_train_class_1 = [X_train[i] for i in range(len(X_train)) if y_train[i]==1]
#Find the class specific likelihoods of each feature

likelihoods_class_0 = np.mean(X_train_class_0, axis=0)/100.0


likelihoods_class_1 = np.mean(X_train_class_1, axis=0)/100.0

#Calculate the class priors

num_class_0 = float(len(X_train_class_0))

27 | P a g e
num_class_1 = float(len(X_train_class_1))

prior_probability_class_0 = num_class_0 / (num_class_0 + num_class_1)


prior_probability_class_1 = num_class_1 / (num_class_0 + num_class_1)

log_prior_class_0 = np.log10(prior_probability_class_0)
log_prior_class_1 = np.log10(prior_probability_class_1)

def calculate_log_likelihoods_with_naive_bayes(feature_vector, Class):


assert len(feature_vector) == num_features
log_likelihood = 0.0 #using log-likelihood to avoid underflow
if Class==0:
for feature_index in range(len(feature_vector)):
if feature_vector[feature_index] == 1: #feature present
log_likelihood +=
np.log10(likelihoods_class_0[feature_index])
elif feature_vector[feature_index] == 0: #feature absent
log_likelihood += np.log10(1.0 -
likelihoods_class_0[feature_index])
elif Class==1:
for feature_index in range(len(feature_vector)):
if feature_vector[feature_index] == 1: #feature present
log_likelihood +=
np.log10(likelihoods_class_1[feature_index])
elif feature_vector[feature_index] == 0: #feature absent
log_likelihood += np.log10(1.0 -
likelihoods_class_1[feature_index])
else:
raise ValueError("Class takes integer values 0 or 1")

return log_likelihood

def calculate_class_posteriors(feature_vector):
log_likelihood_class_0 =
calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=0)
log_likelihood_class_1 =
calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=1)

log_posterior_class_0 = log_likelihood_class_0 + log_prior_class_0


log_posterior_class_1 = log_likelihood_class_1 + log_prior_class_1

return log_posterior_class_0, log_posterior_class_1


def classify_spam(document_vector):
feature_vector = [int(element>0.0) for element in document_vector]
log_posterior_class_0, log_posterior_class_1 =
calculate_class_posteriors(feature_vector)
if log_posterior_class_0 > log_posterior_class_1:
return 0
else:
return 1

28 | P a g e
#Predict spam or not on the test set

predictions = []
for email in X_test:
predictions.append(classify_spam(email))

def evaluate_performance(predictions, ground_truth_labels):


correct_count = 0.0
for item_index in range(len(predictions)):
if predictions[item_index] == ground_truth_labels[item_index]:
correct_count += 1.0
accuracy = correct_count/len(predictions)
return accuracy
accuracy_of_naive_bayes = evaluate_performance(predictions, y_test)
print(accuracy_of_naive_bayes)

for i in range(100):
print predictions[i], y_test[i]

Output :

Conclusion :

29 | P a g e
Experiment No. 8

Problem Statement:

Write a program using SVM on IRIS dataset and carry out classification.

Irish Dataset:
• The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa,
Iris Virginica and Iris Versicolor .
• The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936
paper “The use of multiple measurements in taxonomic problems”.
• Iris data is a multivariate data set.
• Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in
centimeters.
• It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal
width and Class-Labels(Species).

Objective:
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the
three species — Setosa, Virginica and Versicolor.

30 | P a g e
Program:
1. Import Libraries

from future import division, print_function


import numpy as np
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
# from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

2. Prepare dataset

iris = datasets.load_iris()
X = iris.data[:,:2]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.25, random_state=42)

3. Use Support Vector Machine with different kinds of kernels and evaluate
performance

def evaluate_on_test_data(model=None):
predictions = model.predict(X_test)
correct_classifications = 0
for i in range(len(y_test)):
31 | P a g e
if predictions[i] == y_test[i]:
correct_classifications += 1
accuracy = 100*correct_classifications/len(y_test) #Accuracy as a
percentage
return accuracy

kernels = ('linear','poly','rbf')
accuracies = []
for index, kernel in enumerate(kernels):
model = svm.SVC(kernel=kernel)
model.fit(X_train, y_train)
acc = evaluate_on_test_data(model)
accuracies.append(acc)
print("{} % accuracy obtained with kernel = {}".format(acc,
kernel))

4. Visualize the Visualize the decision boundaries

#Train SVMs with different kernels


svc = svm.SVC(kernel='linear').fit(X_train, y_train)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7).fit(X_train, y_train)
poly_svc = svm.SVC(kernel='poly', degree=3).fit(X_train, y_train)

#Create a mesh to plot in


h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

#Define title for the plots


titles = ['SVC with linear kernel',
'SVC with RBF kernel',
'SVC with polynomial (degree 3) kernel']

for i, clf in enumerate((svc, rbf_svc, poly_svc)):


# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.figure(i)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot


Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired , alpha=0.8)

# Plot also the training points


plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.ocean)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])

plt.show()
32 | P a g e
5. Check the support vectors

#Checking the support vectors of the polynomial kernel (for example)

print("The support vectors are:\n", poly_svc.support_vectors_)

Evaluate the performances on the validation and test sets

print("Evaluating K-NN classifier:")


test_err = evaluate_performance(model, x_test, y_test)
print('test misclassification percentage = {}%'.format(test_err))

Output :

Conclusion :

33 | P a g e
Experiment No. 9

Problem Statement:

Write a program using SVM algorithm for Boston house price prediction dataset to predict
price of houses from certain features.

Program:
1. Import Libraries

from future import division, print_function


import numpy as np
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
# from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

2. Load data from the Boston dataset


boston = datasets.load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.25, random_state=42)

3. Use Support Vector Machine with different kinds of kernels and evaluate
performance

def evaluate_on_test_data(model=None):
predictions = model.predict(X_test)
sum_of_squared_error = 0
for i in range(len(y_test)):
err = (predictions[i]-y_test[i]) **2
sum_of_squared_error += err
mean_squared_error = sum_of_squared_error/len(y_test)
RMSE = np.sqrt(mean_squared_error)
return RMSE

34 | P a g e
kernels = ('linear','rbf')
RMSE_vec = []
for index, kernel in enumerate(kernels):
model = svm.SVR(kernel=kernel)
model.fit(X_train, y_train)
RMSE = evaluate_on_test_data(model)
RMSE_vec.append(RMSE)
print("RMSE={} obtained with kernel = {}".format(RMSE, kernel))

Output :

Conclusion :

35 | P a g e

You might also like