Birla Institute of Technology and Science, Pilani
Department of computer science & information system
BITS F464 - Machine Learning
I Semester 2020-21
3-Sep-20 Lab Sheet-03 – Principle Component Analysis
Singular-Value Decomposition
The most known and widely used matrix decomposition method is the Singular-Value
Decomposition, or SVD. All matrices have an SVD, which makes it more stable than other
methods, such as the eigendecomposition.
The SVD is used widely both in the calculation of other matrix operations, such as matrix inverse,
but also as a data reduction, compressing and denoising method in machine learning
Calculate Singular-Value Decomposition
The SVD can be calculated by calling the svd() function.
The function takes a matrix and returns the U, Sigma and V^T elements. The Sigma diagonal
matrix is returned as a vector of singular values. The V matrix is returned in a transposed form,
e.g. V.T.
The example below defines a 3×2 matrix and calculates the Singular-value decomposition.
# Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# SVD
U, s, VT = svd(A)
print(U)
print(s)
print(VT)
Reconstruct Matrix from SVD
The original matrix can be reconstructed from the U, Sigma, and V^T elements.
The U, s, and V elements returned from the svd() cannot be multiplied directly.
The s vector must be converted into a diagonal matrix using the diag() function. By default, this
function will create a square matrix that is n x n, relative to our original matrix. This causes a
problem as the size of the matrices do not fit the rules of matrix multiplication, where the number
of columns in a matrix must match the number of rows in the subsequent matrix.
# Reconstruct SVD
from numpy import array
from numpy import diag
from numpy import dot
from numpy import zeros
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# Singular-value decomposition
U, s, VT = svd(A)
# create m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[1], :A.shape[1]] = diag(s)
# reconstruct matrix
B = U.dot(Sigma.dot(VT))
print(B)
SVD for Pseudoinverse
The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular
matrices where the number of rows and columns are not equal.
It is also called the the Moore-Penrose Inverse after two independent discoverers of the method
or the Generalized Inverse.
# Pseudoinverse via SVD
from numpy import array
from numpy.linalg import svd
from numpy import zeros
from numpy import diag
# define matrix
A = array([
[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8]])
print(A)
# calculate svd
U, s, VT = svd(A)
# reciprocals of s
d = 1.0 / s
# create m x n D matrix
D = zeros(A.shape)
# populate D with n x n diagonal matrix
D[:A.shape[1], :A.shape[1]] = diag(d)
# calculate pseudoinverse
B = VT.T.dot(D.T).dot(U.T)
print(B)
Principal Component Analysis
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to
a new coordinate system such that the greatest variance by some projection of the data comes
to lie on the first coordinate (called the first principal component), the second greatest variance
on the second coordinate, and so on.
Import packages and download the wine dataset from
“https://archive.ics.uci.edu/ml/datasets/wine”
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Read in the data and perform basic exploratory analysis
df = pd.read_csv('./Datasets/wine.data.csv')
df.head(10)
Basic statistics
df.iloc[:,1:].describe()
Boxplots by output labels/classes
for c in df.columns[1:]:
df.boxplot(c,by='Class',figsize=(7,4),fontsize=14)
plt.title("{}\n".format(c),fontsize=16)
plt.xlabel("Wine Class", fontsize=16)
It can be seen that some features classify the wine labels pretty clearly. For example,
Alcalinity, Total Phenols, or Flavonoids produce boxplots with well-separated medians, which are
clearly indicative of wine classes.
Below is an example of class seperation using two variables
plt.figure(figsize=(10,6))
plt.scatter(df['OD280/OD315 of diluted wines'],df['Flavanoids'],c=df['Clas
s'],edgecolors='k',alpha=0.75,s=150)
plt.grid(True)
plt.title("Scatter plot of two features showing the \ncorrelation and clas
s seperation",fontsize=15)
plt.xlabel("diluted wines",fontsize=15)
plt.ylabel("Flavanoids",fontsize=15)
plt.show()
Are the features independent? Plot co-variance matrix
It can be seen that there is some good amount of correlation between features i.e. they are not
independent of each other.
def correlation_matrix(df):
from matplotlib import pyplot as plt
from matplotlib import cm as cm
fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(111)
cmap = cm.get_cmap('jet', 30)
cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
ax1.grid(True)
plt.title('Wine data set features correlation\n',fontsize=15)
labels=df.columns
ax1.set_xticklabels(labels,fontsize=9)
ax1.set_yticklabels(labels,fontsize=9)
# Add colorbar, make sure to specify tick locations to match desired t
icklabels
fig.colorbar(cax, ticks=[0.1*i for i in range(-11,11)])
plt.show()
correlation_matrix(df)
Principal Component Analysis
Data scaling
PCA requires scaling/normalization of the data to work properly
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop('Class',axis=1)
y = df['Class']
X = scaler.fit_transform(X)
dfx = pd.DataFrame(data=X,columns=df.columns[1:])
dfx.head(10)
dfx.describe()
PCA class import and analysis
from sklearn.decomposition import PCA
pca = PCA(n_components=None)
dfx_pca = pca.fit(dfx)
Plot the explained variance ratio
plt.figure(figsize=(10,6))
plt.scatter(x=[i+1 for i in range(len(dfx_pca.explained_variance_ratio_))]
,y=dfx_pca.explained_variance_ratio_, s=200, alpha=0.75,c='orange',edgecol
or='k')
plt.grid(True)
plt.title("Explained variance ratio of the \nfitted principal component ve
ctor\n",fontsize=25)
plt.xlabel("Principal components",fontsize=15)
plt.xticks([i+1 for i in range(len(dfx_pca.explained_variance_ratio_))],fo
ntsize=15)
plt.yticks(fontsize=15)
plt.ylabel("Explained variance ratio",fontsize=15)
plt.show()
The above plot means that the 1st principal component explains about 36% of the total variance
in the data and the 2 ND component explains further 20%. Therefore, if we just consider first two
components, they together explain 56% of the total variance.
Showing better class separation using principal components
Transform the scaled data set using the fitted PCA object
dfx_trans = pca.transform(dfx)
Put it in a data frame
dfx_trans = pd.DataFrame(data=dfx_trans)
dfx_trans.head(10)
Plot the first two columns of this transformed data set with the color set to original ground truth
class label
plt.figure(figsize=(10,6))
plt.scatter(dfx_trans[0],dfx_trans[1],c=df['Class'],edgecolors='k',alpha=0
.75,s=150)
plt.grid(True)
plt.title("Class separation using first two principal components\n",fontsi
ze=20)
plt.xlabel("Principal component-1",fontsize=15)
plt.ylabel("Principal component-2",fontsize=15)
plt.show()
Lab 03 Exercise (Submit the code in given time):
Download any of the Integer attribute type dataset (you can use wine dataset also)
and split the data into training and testing then perform linear regression using any
methods that introduced in previous Lab. Then compare the prediction accuracy
with and without PCA on the training datasets.