0% found this document useful (0 votes)
59 views7 pages

Principal Component Analysis Python

The document provides an overview of Principal Component Analysis (PCA) using Python, detailing its purpose, objectives, and applications in data analysis. It explains the mathematical concepts of eigenvectors and eigenvalues, and outlines the steps to implement PCA, including data preprocessing, applying PCA, and visualizing results with logistic regression. The document also includes code snippets for practical implementation and visualization of PCA results.

Uploaded by

Vinith Acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views7 pages

Principal Component Analysis Python

The document provides an overview of Principal Component Analysis (PCA) using Python, detailing its purpose, objectives, and applications in data analysis. It explains the mathematical concepts of eigenvectors and eigenvalues, and outlines the steps to implement PCA, including data preprocessing, applying PCA, and visualizing results with logistic regression. The document also includes code snippets for practical implementation and visualization of PCA results.

Uploaded by

Vinith Acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Search...

Sign In

Python For Data Analysis Data Science Data Analysis with R Data Analysis with Python Data Visualization with Python Data A

Principal Component Analysis with Python


Last Updated : 11 Jul, 2025

Principal Component Analysis is basically a statistical procedure to convert a set of


observations of possibly correlated variables into a set of values of linearly uncorrelated
variables.

Each of the principal components is chosen in such a way that it would describe most of
them still available variance and all these principal components are orthogonal to each
other. In all principal components, first principal component has a maximum variance.

Uses of PCA
1. It is used to find interrelations between variables in the data.
2. It is used to interpret and visualize data.
3. The number of variables is decreasing which makes further analysis simpler.
4. It's often used to visualize genetic distance and relatedness between populations.

These are basically performed on a square symmetric matrix. It can be a pure sums of
squares and cross-products matrix Covariance matrix or Correlation matrix. A correlation
matrix is used if the individual variance differs much.

Objectives of PCA
1. It is basically a non-dependent procedure in which it reduces attribute space from a
large number of variables to a smaller number of factors.
2. PCA is basically a dimension reduction process but there is no guarantee that the
dimension is interpretable.
3. The main task in this PCA is to select a subset of variables from a larger set, based on
which original variables have the highest correlation with the principal amount.
4. Identifying patterns: PCA can help identify patterns or relationships between variables
that may not be apparent in the original data. By reducing the dimensionality of the
data, PCA can reveal underlying structures that can be useful in understanding and
interpreting the data.
5. Feature extraction: PCA can be used to extract features from a set of variables that are
more informative or relevant than the original variables. These features can then be
used in modeling or other analysis tasks.
6. Data compression: PCA can be used to compress large datasets by reducing the
number of variables needed to represent the data, while retaining as much information
as possible.
7. Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and
removing the principal components that correspond to the noisy parts of the data.
8. Visualization: PCA can be used to visualize high-dimensional data in a lower-
dimensional space, making it easier to interpret and understand. By projecting the data
onto the principal components, patterns and relationships between variables can be
more easily visualized.

Principal Axis Method


PCA basically searches a linear combination of variables so that we can extract maximum
variance from the variables. Once this process completes it removes it and searches for
another linear combination that gives an explanation about the maximum proportion of
remaining variance which basically leads to orthogonal factors. In this method, we analyze
total variance.

Eigenvector
It is a non-zero vector that stays parallel after matrix multiplication. Let's suppose x is an
eigenvector of dimension r of matrix M with dimension r*r if Mx and x are parallel. Then
we need to solve Mx=Ax where both x and A are unknown to get eigenvector and
eigenvalues.
Under Eigen-Vectors, we can say that Principal components show both common and
unique variance of the variable. Basically, it is variance focused approach seeking to
reproduce total variance and correlation with all components. The principal components
are basically the linear combinations of the original variables weighted by their
contribution to explain the variance in a particular orthogonal dimension.

Eigen Values
It is basically known as characteristic roots. It basically measures the variance in all
variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of
explanatory importance of the factors with respect to the variables. If the factor is low
then it is contributing less to the explanation of variables. In simple words, it measures the
amount of variance in the total given database accounted by the factor. We can calculate
the factor's eigenvalue as the sum of its squared factor loading for all the variables.

Now, Let's understand Principal Component Analysis with Python.

To get the dataset used in the implementation, click here.

Step 1: Importing the libraries

# importing required libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step 2: Importing the dataset

Import the dataset and distributing the dataset into X and y components for data analysis.

# importing or loading the dataset


dataset = pd.read_csv('wine.csv')

# distributing the dataset into two components X and Y


X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

Step 3: Splitting the dataset into the Training set and Test set

# Splitting the X and Y into the


# Training set and Testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 4: Feature Scaling

Doing the pre-processing part on training and testing set such as fitting the Standard
scale.

# performing preprocessing part


from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 5: Applying PCA function

Applying the PCA function into the training and testing set for analysis.

# Applying PCA function on training


# and testing set of X component
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

Step 6: Fitting Logistic Regression To the training set

# Fitting Logistic Regression To the training set


from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

Step 7: Predicting the test set result

# Predicting the test set result using


# predict function under LogisticRegression
y_pred = classifier.predict(X_test)

Step 8: Making the confusion matrix

# making confusion matrix between


# test set of Y and predicted value.
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

Step 9: Predicting the training set result

# Predicting the training set


# result through scatter plot
from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train


X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1,
stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1,
stop=X_set[:, 1].max() + 1, step=0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),


X2.ravel()]).T).reshape(X1.shape),
alpha=0.75,
cmap=ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
color=ListedColormap(('red', 'green', 'blue'))(i), label=j)

plt.title('Logistic Regression (Training set)')


plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend
# show scatter plot
plt.show()

Output:

Logistics Regression Training Set

Step 10: Visualizing the Test set results

# Visualising the Test set results through scatter plot


X_set, y_set = X_test, y_test

X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1,


stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1,
stop=X_set[:, 1].max() + 1, step=0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),

X2.ravel()]).T).reshape(X1.shape), alpha=0.75,
cmap=ListedColormap(('yellow', 'white', 'aquamarine')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
color=ListedColormap(('red', 'green', 'blue'))(i), label=j)

# title for scatter plot


plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()

# show scatter plot


plt.show()

Output:
Logistic Regression Test Set

We can visualize the data in the new principal component space:

# plot the first two principal components with labels


colors = ["r", "g", "b"]
labels = ["Class 1", "Class 2", "Class 3"]
for i, color, label in zip(np.unique(y), colors, labels):
plt.scatter(X_train[y_train == i, 0], X_train[y_train == i, 1],
color=color, label=label)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.show()

Output:

PCA Visualize

This is a simple example of how to perform PCA using Python. The output of this code will
be a scatter plot of the first two principal components and their explained variance ratio.
By selecting the appropriate number of principal components, we can reduce the
dimensionality of the dataset and improve our understanding of the data.

Get the complete notebook and dataset link here:

Notebook link : click here.


Dataset Link: click here
Comment More info

Explore
DSA Tutorial - Learn Data Structures and Algorithms 6 min read

System Design Tutorial 3 min read

Aptitude Questions and Answers 3 min read

Web Development Technologies 6 min read

AI, ML and Data Science Tutorial 3 min read

DevOps Tutorial 5 min read

Corporate & Communications Address:


A-143, 7th Floor, Sovereign Corporate
Tower, Sector- 136, Noida, Uttar Pradesh
(201305)

Registered Address:
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam
Buddh Nagar, Uttar Pradesh, 201305
Company Explore Tutorials Courses Videos Preparation
About Us POTD Programming IBM Certification DSA Corner
Legal Job-A-Thon Languages DSA and Placements Python Aptitude
Privacy Policy Community DSA Web Development Java Puzzles
Contact Us Blogs Web Technology Programming C++ GfG 160
Advertise with us Nation Skill Up AI, ML & Data Science Languages Web Development DSA 360
GFG Corporate DevOps DevOps & Cloud Data Science System Design
Solution CS Core Subjects GATE CS Subjects
Campus Training Interview Preparation Trending
Program GATE Technologies
Software and Tools

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

You might also like