UT-1-Machine Learning Lecture Notes-1
UT-1-Machine Learning Lecture Notes-1
Definition: A computer program is said to learn from experience E (data) with respect to some class
of task T (prediction, classification etc..) and performance measure P if its performance on task in T
as measured by P improves with experience E.
ML vs AI vs Data Science
Data Science:
● Based on strict analytical
evidence Artificial Intelligence
● Deal with structured • Imparts human intellects to
&unstructured data machine
Machine Learning
● Includes data various • Uses logic and decision
operations trees
• Subset of AI
• Includes Machine Learning • Uses Statistical models
• Machines improved
Machine Learning
● Data science deals with pre-processing, analyzing, visualizing, and predicting the data. Whereas,
AI implements a predictive model used for forecasting future events.
● Data science banks on statistical techniques while AI leverages computer algorithms.
● The tools used in data science are much more in quantity than the ones used in AI.
● The reason for this is – there are multiple steps for analyzing data and extracting insights from it.
● In data science, the focus remains on building models that use statistical insights, whereas, for AI,
the aim is to build models that can emulate human intelligence.
● Data science strives to find hidden patterns in the raw and unstructured data while AI is about
assigning autonomy to data models.
Data Science vs. Machine Learning
● Artificial intelligence essentially makes machines act out human intelligence while ML deals with
learning from past data without being explicitly programmed.
● AI focuses on making systems that can solve complex problems while ML aims to make machines
Types of Learning
● As with any method, there are different ways to train machine learning algorithms, each with
their own advantages and disadvantages.
● In ML, there are two kinds of data — labeled data and unlabeled data.
● Labeled data has both the input and output parameters in a completely machine-readable
pattern, but requires a lot of human labour to label the data, to begin with.
● Unlabeled data only has one or none of the parameters in a machine-readable form.
● This negates the need for human labour but requires more complex solutions.
● There are also some types of machine learning algorithms that are used in very specific use-
cases, but three main methods are used today.
Machine Learning
Types of Learning
● Regression trains on and predicts a continuous-valued response, for example predicting real
estate prices.
● When output Y is discrete valued, it is classification and when Y is continuous, then it is Regression.
Machine Learning
Types of Learning
vi)Logistic Regression
Machine Learning
Types of Learning
● Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment,
male and female persons, secure and unsecure loans etc.
● Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc. Common examples of supervised learning
include
i) Decision Trees
classifying emails into spam and not-spam
ii) Random Forest
categories, labeling web pages based on
iii) Support vector Machines
their content, and voice recognition.
iv) Neural network
Machine Learning
Types of Learning
to determine how to describe the data. This kind of learning data is called unlabeled data.
Machine Learning
Types of Learning
b. Association:
● An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database.
● It determines the set of items that occurs together in the dataset.
● Association rule makes marketing strategy more effective. Such as people who buy X item (suppose
a bread) are also tend to purchase Y (Butter/Jam) item.
● A typical example of Association rule is Market Basket Analysis.
Machine Learning
[Link] Machine Learning
● K-means clustering
● KNN (K-Nearest
● Hierarchical clustering
Machine Learning
[Link]. Supervised Unsupervised
1 Supervised learning algorithms are Unsupervised learning algorithms are trained using
trained using labeled data. unlabeled data.
2 Supervised learning model takes direct Unsupervised learning model does not take any
feedback to check if it is predicting feedback.
correct output or not.
3
Supervised learning model predicts Unsupervised learning model finds the hidden
The output. patterns in data.
5 The goal of supervised learning is to The goal of unsupervised learning is to find the
train the model so that it can predict the hidden patterns and useful insights from the unknown
output when it is given new data.
dataset.
Machine Learning
Types of Learning
[Link] Learning
● Reinforcement learning directly takes inspiration from how human beings learn from data in their
lives.
● It features an algorithm that improves upon itself and learns from new situations using a trial-and-error
method.
● Favourable outputs are encouraged or ‗reinforced‘, and non-favourable outputs are discouraged or
‗punished‘.
● Based on the psychological concept of conditioning, reinforcement learning works by putting the
algorithm in a work environment with an interpreter and a reward system.
● In every iteration of the algorithm, the output result is given to the interpreter, which decides whether
the outcome is favourable or not.
Machine Learning
Types of Learning
[Link] Learning
● In case of the program finding the correct solution, the interpreter reinforces the solution by providing
a reward to the algorithm.
If the outcome is not favourable, the algorithm is forced to reiterate until it finds a better result.
● In most cases, the reward system is directly tied to the effectiveness of the result.
●
Machine Learning
Types of Learning
[Link] Learning
● In typical reinforcement learning use-cases, such as finding the shortest route between two points on a
map,
the solution is not an absolute value.
● Instead, it takes on a score of effectiveness, expressed in a percentage value.
● The higher this percentage value is, the more reward is given to the algorithm.
● Thus, the program is trained to give the best possible solution for the best possible reward.
● Here learning data gives feedback so that the system adjusts to dynamic conditions in order to achieve
a certain objective.
The system evaluates its performance based on the feedback responses and reacts accordingly.
● The best known instances include self-driving cars and chess master algorithm AlphaGo.
● There are two important learning models in reinforcement learning: Markov Decision Process & Q learning
Machine Learning
Types of Learning
Models of Machine Learning
1. Geometric model,
2. Probabilistic Models,
3. Logical Models,
4. Grouping and grading models,
5. Parametric and non-parametric models.
Models of Machine Learning
[Link] Models
● In Geometric models, features could be described as points in two dimensions (x- and y-axis) or
a three-dimensional space (x, y, and z).
● Even when features are not by nature geometric, they could be modeled in a geometric
manner (for example, temperature as a function of time can be modeled in two axes).
● In geometric models, there are two ways we could impose similarity.
● We could use geometric concepts like lines or planes to segment (classify) the instance space.
● Linear models are parametric, which means that they have a fixed form with a small number of numeric
parameters that need to be learned from data.
● For example, in f (x) = mx + c, m and c are the parameters that we are trying to learn from the data.
● This technique is different from tree or rule models, where the structure of the model (e.g., which
features to use in the tree, and where) is not fixed in advance.
● Linear models are stable, i.e., small variations in the training data have only a limited impact on the
learned model.
● In contrast, tree models tend to vary more with the training data, as the choice of a different split at the
root of the tree typically means that the rest of the tree is different as well.
● As a result of having relatively few parameters, Linear models have low variance and high bias.
Models of Machine Learning
[Link] Models
a. Linear Model
● This implies that Linear models are less likely to overfit the training data than some other
models. However, they are more likely to underfit.
● For example, if we want to learn the boundaries between countries based on labeled data, then
linear models are not likely to give a good approximation.
b. Distance Model
b. Distance Model
● In the context of Machine learning, the concept of distance is not based on merely the physical
distance between two points.
●
● Instead, we could think of the distance between two points considering the mode of transport
between two points.
● Travelling between two cities by plane covers less distance physically than by train because as
the plane is unrestricted.
● Similarly, in chess, the concept of distance depends on the piece used – for example, a Bishop
can move diagonally.
Models of Machine Learning
[Link] Models
b. Distance Model
● Thus, depending on the entity and the mode of travel, the concept of distance can be experienced
differently.
● The distance metrics commonly used are Euclidean, Minkowski, Manhattan, and Mahalanobis.
● Distance is applied through the concept of neighbors and exemplars.
● Neighbors are points in proximity with respect to the distance measure expressed through exemplars.
● Exemplars are either centroids that find a centre of mass according to a chosen distance metric or
medoids that find the most centrally located data point.
● The most commonly used centroid is the arithmetic mean, which minimizes squared Euclidean
distance to all other points.
● The algorithms under Geometric Model: KNN, Linear Regression, SVM, Logistic Regression etc
Models of Machine Learning
[Link] Models
● Generative models estimate the joint distribution P (Y, X). Once we know the joint
distribution for the generative models, we can derive any conditional or marginal
distribution involving the same variables.
● Thus, the generative model is capable of creating new data points and their labels, knowing
the joint probability distribution.
● The joint distribution looks for a relationship between two variables.
● Once this relationship is inferred, it is possible to infer new data points.
● The algorithms under Probabilistic Models: Naïve Bayes , Gaussian Process Regression etc
Models of Machine Learning
[Link] Models
● Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models.
● A logical expression is an expression that returns a Boolean value, i.e., a True or False outcome.
● Once the data is grouped using a logical expression, the data is divided into homogeneous
groupings for the problem we are trying to solve.
● For example, for a classification problem, all the instances in the group belong to one class.
● There are mainly two kinds of logical models: Tree models and Rule models.
● Rule models consist of a collection of implications or IF-THEN rules. For tree-based models, the
‗if-part‘ defines a segment and the ‗then-part‘ defines the behaviour of the model for this
● Tree models can be seen as a particular type of rule model where the if-parts of the rules are
organized in a tree structure.
● Both Tree models and Rule models use the same approach to supervised learning. The
approach can be summarized in two strategies:
a) we could first find the body of the rule (the concept) that covers a sufficiently homogeneous
set of examples and then find a label to represent the body.
b) Alternately, we could approach it from the other direction, i.e., first select a class we want
to learn and then find rules that cover examples of the class.
Models of Machine Learning
[Link] Models
● To understand logical models further, we need to understand the idea of Concept Learning. Concept
Learning involves learning logical expressions or concepts from examples.
● The idea of Concept Learning fits in well with the idea of Machine learning, i.e., inferring a general
function from specific training examples.
● Concept learning forms the basis of both tree-based and rule-based models.
● More formally, Concept Learning involves acquiring the definition of a general category
from a given set of positive and negative training examples of the category.
● A Formal Definition for Concept Learning is ―The understanding of a Boolean-valued function
from training examples of its input and output.‖
● In concept learning, we only learn a description for the positive class and label everything that
doesn‘t satisfy that description as negative.
● The algorithms under Logical Models: Decision Tree, Random Forest etc.
Models of Machine Learning
[Link] and Grading Models
The key difference between Grouping and Grading is the way they handle the instance space.
a) Grouping Model:
● Grouping models breaks ups the instance space into groups or segments , the number
of which is determined at training time.
● They have fixed resolution that is they cannot distinguish instances beyond resolution.
● At the finest resolution grouping models assign the majority class to all instances that fall into
the segment.
● Determine the right segments and label all the objects in that segment.
● Example the tree model split the instance space into smaller subsets. Trees are usually of
limited depth and don't contain all the available features. The subset at the leaves of the tree
partition , the instance space with some finite resolution. Instances filtered into the same leaf
of the tree are treated the same regardless of any features not in the tree that might be able to
distinguish them.
Models of Machine Learning
[Link] and Grading Models
b) Grading Model:
a) Parametric Model:
● An easy to understand functional form for the mapping function is a line, as is used in linear
regression:b0 + b1*x1 + b2*x2 = 0
● Where b0, b1 and b2 are the coefficients of the line that control the intercept and slope, and x1
and x2 are two input variables.
● Assuming the functional form of a line greatly simplifies the learning process.
● Now, all we need to do is estimate the coefficients of the line equation and we have a predictive
parametric machine learning algorithms are often also called ―linear machine learning
algorithms―.
Models of Machine Learning
[Link] and Non Parametric Models
a) Parametric Model:
● The problem is, the actual unknown underlying function may not be a linear function like a line.
● It could be almost a line and require some minor transformation of the input data to work right.
● Or it could be nothing like a line in which case the assumption is wrong and the approach will
produce poor results.
● Some more examples of parametric machine learning algorithms include:
1. Logistic Regression
2. Linear Discriminant Analysis
3. Perceptron
4. Naive Bayes
5. Simple Neural Networks
Models of Machine Learning
5. Parametric and Non Parametric Models
a) Parametric Model:
Benefits
● Simpler: These methods are easier to understand and interpret results.
● Speed: Parametric models are very fast to learn from data.
● Less Data: They do not require as much training data and can work well even if the fit to the data is
not perfect.
Limitations
● Constrained: By choosing a functional form these methods are highly constrained to the specified
form.
● Algorithms that do not make strong assumptions about the form of the mapping function are
called nonparametric machine learning algorithms.
● By not making assumptions, they are free to learn any functional form from the training data.
● Nonparametric methods are good when you have a lot of data and no prior knowledge,
and when you don‘t want to worry too much about choosing just the right features.
● Nonparametric methods seek to best fit the training data in constructing the mapping function,
though maintaining some ability to generalize to unseen data.
● As such, they are able to fit a large number of functional forms.
Models of Machine Learning
[Link] and Non Parametric Models
● An easy to understand nonparametric model is the k-nearest neighbors algorithm that makes
predictions based on the k most similar training patterns for a new data instance.
● The method does not assume anything about the form of the mapping function other than
patterns that are close are likely to have a similar output variable.
● Overfitting: More of a risk to overfit the training data and it is harder to explain why specific
predictions are made.
Data Formats in Machine Learning
Data Formats in Machine Learning
● Each data format represents how the input data is represented in memory.
● This is important as each machine learning application performs well for a particular data
format and worse for others.
● Interchanging between various data formats and choosing the correct format is a major
optimization technique.
Each letter in the formats denotes a particular aspect/ dimension of the data:
● N: Batch size : is the number of images passed together as a group for inference
● C: Channel : is the number of data components that make a data point for the input data. It
is 3 for opaque images and 4 for transparent images.
● H: Height : is the height/ measurement in y axis of the input data
● W: Width : is the width/ measurement in x axis of the input data
● D: Depth : is the depth of the input data
Important Elements of Machine Learning
Data Formats in Machine Learning
1) NHWC
NHWC denotes (Batch size, Height, Width, Channel). This means there is a 4D array where the first
dimension represents batch size and accordingly. This 4D array is laid out in memory in row major
order. Hence, you can visualize the memory layout to imagine which operations will access
consecutive memory (fast) or memory separated by other data (slow).
2) NCHW
NCHW denotes (Batch size, Channel, Height, Width). This means there is a 4D array where the first
dimension represents batch size and accordingly. This 4D array is laid out in memory in row major
order.
Important Elements of Machine Learning
Data Formats in Machine Learning
3) NCDHW
NCHW denotes (Batch size, Channel, Depth, Height, Width). This means there is a 5D array
where the first dimension represents batch size and accordingly. This 5D array is laid out
in
memory in row major order.
4) NDHWC
NCHW denotes (Batch size, Depth, Height, Width, Channel). This means there is a 5D array
where the first dimension represents batch size and accordingly. This 5D array is laid out
in
Important Elements of Machine Learning
Learnability in Machine Learning
● Learn ability is a quality of products and interfaces that allows users to quickly become familiar with
them and able to make good use of all their features and capabilities.
● Learn ability is one component of usability and is often heard in the context of user interface or user
experience design, as well as usability and user acceptance testing.
● A very learnable interface or product is sometimes said to be intuitive because the user can immediately
grasp how to interact with the system.
● First-time learn ability refers to the degree of ease with which a user can learn a newly-encountered
system without referring to documentation, such as manuals, user guides or FAQ (frequently-asked
questions) lists.
● One element of first-time learn ability is discoverability, which is the degree of ease with which the user
can find all the elements and features of a new system when they first encounter it.
Important Elements of Machine Learning
Learnability in Machine Learning
● Learn ability over time, on the other hand, is the capacity of a user to gain expertise in working with
a
given system through repeated interaction.
● There are three different aspects of Learn ability
● Relatively simple systems with good learn ability are said to have short or steep learning curves,
meaning that most learning associated with the system happens very quickly, after which the rate of
learning levels off or plateaus.
Important Elements of Machine Learning
functional analysis.
● It deals with finding a predictive function based on the data presented.
● The main idea in statistical learning theory is to build a model that can draw conclusions from
data and make predictions.
Important Elements of Machine Learning
Statistical Learning Approaches
5. Statistics in Prediction
Statistical methods are required when making a prediction with a finalized model on
new data.
6. Problem Framing:
● Requires the use of exploratory data analysis and data mining.
7. Data Cleaning.
It is a statistical process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation.
It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
It is a technique to draw strong patterns from the given dataset by reducing the
variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality.
• Mean: The mean, also known as the average, is a measure of the central tendency of a dataset. It is
calculated by summing up all the values in the dataset and dividing them by the number of values.
For a dataset with n values x1, x2, x3, ……., xn the mean μ is given by:
μ = (1/n) ∑i=1:n xi
Mean Example
Variance Formula
For a dataset with n values x1, x2, x3, ……., xn the mean σ2 is given by:
Covariance provides insight into how two variables are related to one another.
More precisely, covariance refers to the measure of how two random variables in a data set will change
together.
A positive covariance means the variables at hand are positively related. Meaning they move in the
same direction.
A negative covariance means the variables are inversely related, or that they move in opposite
directions.
σ= √ (1/N) Σi=1:n(x−μ) 2
The specific set of scalars connected with the system of linear equations is known as eigenvalues.
As a result, eigenvalue can also be referred to as a characteristic value, a characteristic root, appropriate
values, or latent roots.
Ax = λx
The eigenvalue of A is the number or scalar value ―λ‖.
Eigenvector
When a linear transformation is applied, eigenvectors are non-zero vectors that do not change
direction.
It only varies by scalar quantity.
if A(X) is a scalar multiple of x. A set of all the eigenvectors with the identical eigenvalue, jointly with
the zero vector, makes up an Eigenspace of vector x.
The zero vector, however, is not an eigenvector. If A is an ―n×n‖ matrix and is an eigenvalue of A,
then x, a non-zero vector, is called an eigenvector if it fulfills the following expression:
Ax = λx
Dimensionality:
It is the number of features or variables present in the given dataset. More easily, it is the number of
columns present in the dataset.
Correlation:
It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed.
The correlation value ranges from -1 to +1.
Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
Orthogonal:
It defines that variables are not correlated to each other, and hence the correlation between the pair of
variables is zero.
Eigenvectors:
If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is the
scalar multiple of v.
Covariance Matrix:
A matrix containing the covariance between the pair of variables is called the Covariance Matrix.
Covariance Matrix
The variance-covariance matrix is a square matrix with diagonal elements that represent the variance
and the non-diagonal components that express covariance.
The covariance of a variable can take any real value- positive, negative, or zero.
A positive covariance suggests that the two variables have a positive relationship, whereas a negative
covariance indicates that they do not.
[Link]
How to Find Covariance Matrix?
The dimensions of a covariance matrix are determined by the number of variables in a given data set.
If there are only two variables in a set, then the covariance matrix would have two rows and two
columns.
Similarly, if a data set has three variables, then its covariance matrix would have three rows and three
columns.
Psycholog
Student History(Y)
y(X)
Anna 80 70
The data pertains to marks scored by Anna, Caroline, and Laura in Psychology &
History. Make a covariance matrix. Caroline 63 20
Laura 100 50
The following steps have to be followed:
Step 1: Find the mean of variable X. Sum up all the observations in variable X and divide the sum obtained with
the number of terms. Thus, (80 + 63 + 100)/3 = 81.
Step 2: Subtract the mean from all observations. (80 – 81), (63 – 81), (100 – 81).
Step 3: Take the squares of the differences obtained above and then add them up. Thus, (80 – 81)2 + (63 – 81)2 +
(100 – 81)2.
Step 4: Find the variance of X by dividing the value obtained in Step 3 by 1 less than the total number of
observations.
var(X) = [(80 – 81)2 + (63 – 81)2 + (100 – 81)2] / (3 – 1) = 343.
Step 5: Similarly, repeat steps 1 to 4 to calculate the variance of Y.
Var(Y) = 633.333
Step 6: Choose a pair of variables.
Step 7: Subtract the mean of the first variable (X) from all observations; (80 – 81), (63 – 81), (100 – 81).
Step 8: Repeat the same for variable Y; (70 – 47), (20 – 47), (50 – 47).
Step 9: Multiply the corresponding terms: (80 – 81)(70 – 47)X (63 – 81)(20 – 47)X(100 – 81)(50 – 47).
Step 10: Find the covariance by adding these values and dividing them by (n – 1).
Cov(X, Y) = [(80 – 81)(70 – 47) + (63 – 81)(20 – 47) + (100 – 81)(50 – 47)]/(3-1) = 260.
Step 11: Use the general formula for the covariance matrix to arrange the terms.
Example applications:
• Face recognition
• Image compression
• Gene expression analysis
The main goal of Principal Component Analysis
(PCA) is to reduce the dimensionality of a dataset
while preserving the most important patterns or
relationships between the variables without any prior
knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of
a data set by finding a new set of variables, smaller than the original set of
variables, retaining most of the sample‘s information, and useful for
the regression and classification of data.
Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies
a set of orthogonal axes, called principal components, that capture the maximum variance in the
data.
The principal components (PC) are linear combinations of the original variables in the dataset
and are ordered in decreasing order of importance.
The total variance captured by all the principal components is equal to the total variance in the
original dataset.
The first principal component captures the most variation in the data, but the second principal
component captures the maximum variance that is orthogonal to the first principal component,
and so on.
In Principal Component Analysis, it is assumed that the information is carried in the variance of
the features, that is, the higher the variation in a feature, the more information that features
carries.
Applications of PCA
In feature selection, PCA can be used to identify the most important variables in
a dataset.
In data compression, PCA can be used to reduce the size of a dataset without
losing important information.
PCA (Principal Component Analysis)
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean
of 0 and a standard deviation of 1.
Z=(X−μ)/σ
Here,
μ is the mean of independent features μ={μ1,μ2,⋯,μm}
σ is the standard deviation of independent features
σ={σ1,σ2,⋯,σm}
Step2: Covariance Matrix Computation
PCA is the most widely used tool in exploratory data analysis and in machine
learning for predictive models.
Y1
Note: Y1 is the Y2 x
x
first eigen vector, x
xx
x x
x x
x x
Y2 is the second. x
x
x x
x x
Y2 ignorable. x x x X1
x
x
x Key observation:
x
x variance = largest!
Principal Component Analysis: one
attribute first Temperature
42
40
Question: how much spread is in the data along 24
the axis? (distance to the mean) 30
15
18
Variance=Standard deviation^2
15
30
n 15
(X i X) 2
30
s2 i 1 35
(n 1) 30
40
30
Now consider two dimensions
X=Temperature Y=Humidity
40 90
40 90
Covariance: measures the
40 90
correlation between X and Y
30 90
• cov(X,Y)=0: independent
15 70
•Cov(X,Y)>0: move same dir
15 70
•Cov(X,Y)<0: move oppo dir
15 70
30 90
15 70
n
30 70
(X
i 1
i X )(Yi Y ) 30 70
cov(X , Y ) 30 90
(n 1)
40 70
30 90
More than two attributes: covariance matrix
Contains covariance values between all possible dimensions (=attributes):
2 3 3 12 3
x 4 x
2 1 2 8 2
103
Principal components
104
Steps of PCA
Adjust the original data by the mean eigenvectors of C is e such that Ce=e,
is called an eigenvalue of C.
X’ = X – X Ce=e (C-I)e=0
Compute the covariance matrix C of Most data mining packages do this for you.
adjusted X
j n
V j 100 n x n
x 1
x
x 1
106
Principal components - Variance
25
20
Variance (%) 15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
107
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number
of top eigenvalues
• These are the directions with the largest variances
yi1 e1 xi1 x1
yi 2 e2 xi 2 x2
... ...
...
y e x x
ip p in n 108
An Example Mean1=24.1
Mean2=53.8
100
90
80
70
60
X1 X2 X1' X2' 50 Series1
40
30
19 63 -5.1 9.25 20
10
39 74 14.9 20.25 0
0 10 20 30 40 50
30 87 5.9 33.25
40
30 23 5.9 -30.75 30
20
15 35 -9.1 -18.75
10
15 43 0 Series1
-9.1 -10.75
-15 -10 -5 0 5 10 15 20
-10
-30
30 73 5.9 19.25 -40
109
Covariance Matrix
75 106
C=
106 482
Eigenvectors:
e1=(-0.98,-0.21), 1=51.8
e2=(0.21,-0.98), 2=560.2
110
If we only keep one dimension: e2
yi
-10.14
0.5 -16.72
We keep the dimension of 0.4 -31.35
e2=(0.21,-0.98) 0.3 31.37
4
We can obtain the final data 0.2
16.46
0.1
as 0
4
8.624
-40 -20 -0.1 0 20 40 19.40
-0.2 4
-0.3 -17.63
-0.4
-0.5
xi1
yi 0.21
0.98 0.21* xi1 0.98 * xi 2
xi 2
111
112
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis
or Discriminant Function Analysis, is a dimensionality reduction technique primarily
utilized in supervised classification problems.
But Linear Discriminant Analysis fails when the mean of the distributions are shared,
as it becomes impossible for LDA to find a new axis that makes both classes linearly
separable.
1. Compute the mean vectors for the different classes from the dataset.
2. Compute the eigenvectors and corresponding eigenvalues for the scatter matrices.
3. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the
largest eigenvalues.
4. Use this eigenvector matrix to transform the samples onto the new subspace.
Summarizing the LDA approach in 5 steps
[Link] the d-dimensional mean vectors for the different classes from the dataset.
[Link] the eigenvectors by decreasing eigenvalues and choose kk eigenvectors with the
largest eigenvalues to form a d×kd×k dimensional matrix WWWW (where every column
represents an eigenvector).
[Link] this d×kd×k eigenvector matrix to transform the samples onto the new subspace. This
can be summarized by the matrix multiplication: YY=XX×WWYY=XX×WW (where XXXX is
a n×dn×d-dimensional matrix representing the nn samples, and yyyy are the
transformed n×kn×k-dimensional samples in the new subspace).
5.[Link]