0% found this document useful (0 votes)
10 views60 pages

Machine Learning and Its Algorithms

The document provides an overview of various machine learning algorithms, focusing on Linear Regression, Ridge Regression, Logistic Regression, Naive Bayes, Gaussian Naive Bayes, and Decision Trees. It explains concepts such as residuals, R-squared, bias, variance, and maximum likelihood estimation, along with practical examples for each algorithm. Additionally, it discusses the significance of model evaluation metrics and the implications of overfitting and underfitting in machine learning models.

Uploaded by

deepanshu sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views60 pages

Machine Learning and Its Algorithms

The document provides an overview of various machine learning algorithms, focusing on Linear Regression, Ridge Regression, Logistic Regression, Naive Bayes, Gaussian Naive Bayes, and Decision Trees. It explains concepts such as residuals, R-squared, bias, variance, and maximum likelihood estimation, along with practical examples for each algorithm. Additionally, it discusses the significance of model evaluation metrics and the implications of overfitting and underfitting in machine learning models.

Uploaded by

deepanshu sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MACHINE LEARNING AND ITS ALGORITHMS

Linear Regression
->Let there be a data set as shown below:

Our goal is to predict the Mouse size from the Mouse weight for this :
1)Draw a line through the data, and measure the
perpendicular distance of the samples from the
line. Keep rotating the line.
2)The perpendicular distance of the samples
from the
line is called RESIDUALS.
3)Sum up the square of the RESIDUALDS and plot
a graph that contains the sum of square of the
RESIDUALS.
4)Identify the line for which the sum of square of
the residuals is the least and identify the
equation of the specific line.
Let the equation of the line be y=ax+b
5)If the slope of the line is not 0, then knowing
the Mouse Weight will help in guessing the
Mouse size.
Value of R2 determines how good the guess really
is.
SS (mean)= ∑ (RESIDUALS)2 ,here the Residuals are
obtained when the line is perpendicular to y axis.
SS (mean)= (data-mean)2
Variation around the mean = SS (mean)/n
Var (mean) = SS (mean)/n
SS (fit) = ∑ (RESIDUALS)2, here the Residuals are
obtained when from line for which ∑ (RESIDUALS)2 is
minimum.
SS(fit)= (data-line)2, Var(fit)= SS(fit)/n
R2= Var(mean)-Var(fit)/Var(mean)
If R2=0.6 then it means that there is a 60% reduction in
variance when the mouse weight is taken into account to
calculate the mouse size.
The above equations are applicable to all forms of data
whether it is 2D,3D or 4D.
For example: We want to know how well the mouse body
weight, mouse tail length predicts its body length. Thus a 3D
plot is used, and a 3D plane is found from which the sum of
squares of residuals is the least and the equation of specific
plane can be found, by finding the equation of the plane the
body length can also be predicted. By using the same method
the SS (mean) and SS (fit) can be found that can be used to
found R2

Now if any of the parameters is useless and does not makes


SS (fit) smaller then it can be neglected and made zero. For
eg: Tail length does not makes SS (fit)
smaller then, z parameter can be made
zero and the equation reduces to
y=0.1+0.7x
IMPORTANT CASE (n=2)
If the number of samples is equal to 2, then SS (fit) can be
made 0, as any two random points can be connected by a
straight line. This makes the value of SS (fit)=0 in all cases as
a result R2=1 or 100%.
But as this applies to all the datasets with n=2, it is
insignificant, and can be made significant by the use of p-
value.
F[SS (fit)]/[n-pfit] = [SS (Mean)-SS (fit)] / [pfit -
pmean]
pfit= number of parameters in the fit line
pmean= number of parameters in the mean
line
n=number of samples
Ridge Regression
If we have 2 samples and linear regression is used the sum of
squares of the residuals minimizes to 0 in all the cases.
Whereas, if Ridge Regression is used for the same data it
minimizes the value of the ∑ (RESIDUALS)2+λ(slope)2
The equation for the least squares fit line is
Size=0.4+1.3(weight), slope =1.3, as a
result the sum of squares of the residuals
for the least squares fit line=0. So for least
squares fit line we have ∑ (RESIDUALS)
2
+λ(slope)2= λ(1.69)
For the Ridge Regression line Size=
0.9+0.8(weight) so:
∑ (RESIDUALS)2+λ(slope)2=
(0.1)2+(0.3)2+λ(0.8)2=0.1+λ(0.64)
The λ(slope)2 is called the Ridge Regression Penalty.
Bias and Variance
If we measure the weight and height of a bunch of mice and
now want to find the height of a mice on the basis of weight.
So the data is split into training set and testing set and an arc
is drawn in the plot
If Linear Regression is used a straight line will ne drawn in
the plot that will never be able to align with arc. The
straight line will never capture the true relationship
between height and weight for any data set
The inability of a machine learning method
to capture the true relationship is called
Bias. As straight line can’t be curved it has
large Bias.
Whereas, other machine learning method the Squiggly line
is super flexible and hugs the training set along the arc thus
it has low Bias.
In machine learning the difference in fits
between data sets is called Variance
As the Squiggly line fits the training set
well but not the testing set well it is
Overfit.
Logistic Regression
It is different from Linear Regression, Let the Y-Axis denote
the probability that a mouse is obese. The dotted line is fit to
the data to find whether a mouse is obese depending on its
weight. Moreover the curve is not a straight
line so it does not include calculation of
Residuals and R2. In linear regression the
numbers on
Y -axis in theory can be any number
If in Logistic Regression the Y-AXIS terms are represented in
the form Log(p/1-p) then the Y-AXIS extend from -∞ to +∞
as Log(p/1-p) then if we put p=0.5 we get Log(1)=0,
Similarly, if p=0 then Log(0)= -∞, if p=1 then Log(1)= +∞ as a
result the graph transforms to :-
The above graph is similar to that of linear
regression and has equation similar to
y=ax+b+ standard error
LOGISTSIC REGRESSION for DISCRETE VARIABLES
Let there be some Obese and some non Obese mice, out of
which some have normal gene and some have mutated gene

This data is fit to the above data and logistic regression is


applied along with log odds.
The number of Obese mice with Normal Gene = 2, Non
Obese mice with Normal Gene = 9,so Log(2/9)= -1.5 and a
line is drawn at Y=-1.5 let this line be called log oddsnormal
The number of Obese mice with Mutated Gene = 7, Non
Obese Mice with Mutated gene=3, Log(7/3)=0.85 and a line
is drawn at Y=0.36 let call this line log oddsmutated

These two lines come together to form coefficients in the


equation below:
Size= (log oddsnormal) B1+(log oddsmutated-log oddsnormal) B2
Size=-1.35B1 + 2.35B2
MAXIMUM LIKELIKOOD
In Logistic Regression the log Odds in Y-Axis can be drawn,
this allows to draw the best fitting line however the
drawing the log Odds projects the data to +∞ and -∞ thus
maximum likelihood is used.
In maximum likelihood the datapoints are projected on the
candidate line thus giving each sample a log(odds) value.
Then the candidate log(odds) is converted back to the original
candidate probability using the given formula:
p= elog(odds)/1+elog(odds), eg: if log(odds)=2.1 then p=0.1
Once the data is transformed a squiggly line is drawn through
it.

Finally Maximum likelihood is calculated by multiplying the


Y value of all the obese mice and multiplying the (1-Y)
value of all the non-obese mice. Finally, the
Log(likelihood) is calculated.
The target is to achieve the achieve the max. value of Log
likelihood so the candidate line is rotated, this
changes the position of data and shifts the
squiggly line, due to which the value of
Log(likelihood) also changes.
The line giving max value of Log(likelihood) is the best fit
line.
Mc Fadden’s Pseudo R2
If the data is is projected to the best fitting line, the maximum
likelihood obtained for the best fitting line is -3.77. The
maximum likelihood of the best fitting line is called LL (fit) .
In this case LL(fit)= -3.77
Then the log(odds of obesity is calculated) which is equal to
the Log of the ratio of the total number of obese mice and
non obese mice.
Log odds of obesity =Log (obese mice/Non
obese mice)
If obese mice=5, Non obese mice =4 Log odds of
obesity=0.22 thus a line is drawn at Y=0.22 and the data is
projected on the line and the maximum likelihood is
calculated.
The maximum likelihood obtained for this line is called LL
(overall probability)
NAIVE BAYES
Imagine we want to filter the spams received from the
normal messages. We separate the normal messages and
find the probability of the occurrence of different words in
the normal message.
Similarly, we find the probability of different word received in
the spams.
Eg-> If the normal messages have a total of
17 words, consisting of 5 ‘Friend’, 3
‘Lunch’, 1 ‘Money’, 8’Dear’ so probability of
‘Dear’=0.47,’Money’=0.06, ’Lunch’=0.18,
‘Friend’=0.29
Similarly, probability is calculated for the words present in
the spams

Now if a new message ‘Dear Friend’ arrives and we need to


know whether it is a spam or a normal message as 8 of 12
messages in the training data are normal message we
might take a guess that any message regardless of what it
says is a normal message.
Prior Probability: The Initial Guess that we
observe a normal message is called Prior
Probability.
The Prior Probability of every new message
being a normal message is 8/(8+4) =
p(N)=0.67, as number of normal message is
8, and number of spams is 4 in the training
data.
As the new message contains ‘Dear’ and
‘Friend’ and probability of ‘Dear’ and
‘Friend’ for normal message in the training
data is p(Dear/N)=0.47, p(Friend/N)=0.29
P(N)x p(Dear/N) x p(Friend/N) =0.09
Now Similarly, assuming that any new
message is a spam we get p(S)=0.33,
p(Dear/S)=0.29, p(Friend/S)=0.14
P(S)x p(Dear/S) x p(Friend/S) =0.0133
As the score obtained for normal message>
Spam it is decided that the new message is
a normal message.
Now if a new message: ‘Lunch Money
Money Money Money’ arrives:-
Making the assumption that every new
message is a normal message we have:
p(N)x p(Lunch/N) x p(Money/N)4=1.56x10-6
Now making assumption that every new
message is a spam we have:
p(S)x p(Lunch/S) x p(Money/S)4=0
As the score for normal message is greater
so the message is a normal message.
As the p(Lunch/S)=0 thus any message
containing the word ‘Lunch’ will be treated
as a normal message even if other words
like ‘Money’ are repeated infinite times to
avoid this problem a black box is added to
each word so that the probability of any
word is not 0 neither in Spam nor in the
normal message.
GAUSSIAN NAIVE BAYES
Imagine we want to predict if someone loves the 1990
movie troll 2 or not. So data is collected from people who
like the Troll 2 and the people who love Troll 2, this data
forms the training data
It is imagined that the data measured
represents the Mean and Standard
Deviation of the Popcorns they ate, Amount
of Soda they drink, Amount of Candy they
ate.
On the basis of the values of the Means and
standard deviation the Normal distribution
graphs are drawn for both groups.

The Red graph represents Graph for people who don’t love
Troll 2, the Green Graph represents people who love Troll 2
Now if someone new shows up and tells
that he drank 500ml of Soda, 25g of Candy,
20g of Popcorns and we need to tell
whether he likes Troll 2 or not.
1)An initial guess is made that the person
loves Troll2, as 8 out of 16 persons in the
training data love Troll2 so p(Love
Troll2)=0.5, p(No Love Troll2)=0.5
2)The initial guesses are called Prior
Probabilities
3)The probability for initial guess is
multiplied with the likelihoods for
Popcorns, Soda and Candy
4)So we get:
p(Love Troll2)x L(popcorn/Loves)x
L(Soda/Loves)x L(Candy/Loves)=x
Note: If this number turns out to be really
small, Log of the resulting number is taken
to prevent underflow. In the above case
L(Candy/Loves) is a really small number so
Log of the resulting number is taken to
prevent underflow.
5)So we get:- Log(x)=-124
6)Similar process is performed by assuming
that new person does not loves Troll2
7)So, we get:
p(No Love)x p(No Love/Popcorn)x p(No
Love/Soda)x p(No Love/Candy)=y
8)As it is really small number Log(y) is
calculated to prevent underflow so we get
result as -48.
9)Since score for No Love>>Love so the
person does not love Troll2
DECISION and CLASSIFICATION TREE
When a Decision Tree classifies things into categories it is
called a classification tree
When a Decision Tree predicts numerical values it is called a
Regression Tree
Root Node: The Top of the Tree is called the
Root Node.
Internal Nodes: The rest of the nodes
except the root node are called Internal
Node.
Leaf Nodes: The Nodes present at the end
are called Leaf Nodes. Leaf Nodes have
arrows pointing to them ,but no arrows
pointing away from them.
Designing the Decision Tree
Suppose we have data about various
persons regarding whether they love
Popcorn, Soda, their age and whether they
love the movie ‘Cool As Ice’.
1)To design the tree, we need to decide
whether ‘Loves Popcorn’ ,’Loves Soda’ or
Age is the question in the root node.
2)Thus we need to know how well ‘Loves
Popcorn’ predicts whether someone loves
the movie ‘Cool As Ice’.

3)Thus ‘Loves Popcorn’ is the question in


the Root Node and number of people who
Love Popcorn but Love/don’t Love ‘Cool As
Ice’ is calculated. In the same way the
number of persons who don’t love Popcorn
but Love/don’t Love ‘Cool As Ice’ is also
calculated.
4)In the same way another simple tree is
designed and ‘Loves Soda’ is the question
in the Root Node. Moreover the number of
people who Love Soda but Love/don’t Love
‘Cool As Ice’ is calculated. In the same way
the number of persons who don’t love Soda
but Love/don’t Love ‘Cool As Ice’ is also
calculated.
5)These value give a clear idea that
whether ‘Love Popcorn’ or ‘Love Soda’
predicts Love ‘Cool As Ice’ in a better way.

6)Looking at two trees it can be seen that


none thus a good job of predicting whether
someone love ‘Cool As Ice’ or not. Thus this
data is called Impure
7)Thus, Gini Impurity is calculated:
Ginileft=1-p(Yes)2left-p(No)2left=0.375
Giniright=1-p(Yes)2right-p(No)2left=0.444
Ginipopcorn=Ginileft p(left)+Giniright
p(right)=0.4045
8)Gini Impurity for Soda is:
Ginileft=1-p(Yes)2left-p(No)2left=0.375
Giniright=1-p(Yes)2right-p(No)2left=0
Ginisoda=Ginileft p(left)+Giniright
p(right)=0.214
9)The Gini Impurity for age<15 is 0.343
11)Calculation of Gini Impurity for Numbers
The Gini Impurity for numerical values is
calculated by calculating the average value
of the adjacent numbers, and finally the
Gini Impurity values are calculated for the
average values.
Eg->
7 Yes The average value for 7 and
12 is 9.5
12 Yes The average value for 12 and
15 is 13.5
15 No The average value for 15 and
19 is 17
19 No The average value for 19 and
26 is 22.5
26 No The average value for 26 and
11 is 18.5
11 Yes The average value for 11 and
18 is 14.5
18 No The average value for 18 and
34 is 26
34 No
So, we get a new list:
9.5, 13.5, 17, 22.5, 18.5, 14.5, 26
Yes, Yes, No, No, No, Yes, No, No ‘Love
Candy’
7 , 12, 15, 19, 26, 11, 18, 34
‘Age’
It is visible that only one person has Age
less than 9.5, rest 7 persons have age
greater than 7 moreover all the persons
with age greater than 7, only 2 love candy.
Thus, Gini Impurity for 9.5 can be
calculated as follows:
Ginileft=1-p(Yes)2-p(No)2=0
Giniright=1-p(Yes)2-p(No)2=1-(2/7)2-
(5/7)2=0.408
Gini9.5=Ginileft p(left)+Giniright p(right)=0.357
Here, Ginileft contains people with age<9.5
and Giniright contains people with age >9.5
Similarly, the Gini Impurity for 22.5 can be
calculated as 2 persons have age>22.5 and
6 persons have age less than 22.5
Out of the 2 persons with age>22.5 None
loves candy, and out of 6 persons with
age<22.5, 3 love candy and 3 don’t love
candy. So, we have:
Ginileft=1-p(Yes)2-p(No)2=0.5
Giniright=1-p(Yes)2-p(No)2=0
Gini22.5= Ginileft p(left)+Giniright
p(right)=0.375
12)Now in the Given data Age<15 gives the
lest Gini Impurity so it is chosen as a
parameter from the numerical data
13)From the calculations we get Gini
Impurity of Soda=0.214, Gini Impurity of
Age<15=0.343, Gini Impurity of
Popcorn=0.405 so Love Soda is chosen as
root node as it has least Gini Impurity.
14)Now, to select the next nodes we use
the data ‘Cool As Ice’ the people who love
Love Soda go to the Left , the People who
don’t love Soda go to Right.
So, we get:

15)To extend the tree from the left we need


to decide whether ‘Love Popcorn’ or
‘Age<15’ is a better parameter.
16)We start the process by asking whether
the people on the left Node ‘love Popcorn’
and find its Gini Impurity
so we get:
Ginileft=1-p(Yes)2-p(No)2=0.5
Giniright=1-p(Yes)2-p(No)2=0
Ginipopcorn=Ginileft p(left)+Giniright
p(right)=0.25
17)Similar, process is performed by
replacing ‘love popcorn’ by ‘Age<12.5’ as
this time the age of people those people
who love soda is considered and we get:

Ginileft=0, Giniright=0, Gini=0


18)Thus Age<12.5 is chosen as the next
node.
19)As a result we get:

The Nodes in the green color are leaf Nodes


as there is no reason to continue splitting
them into new groups and same applies to
the right node thus the tree is built, and
will help to make predictions with future
data.
REGRESSION TREE
Let us assume that we are given data about a drug dosage
and its effectiveness as shown below, Regression Tree is used
for simple and fast classification and prediction.
The 2 points on the bottom left have an
average dosage of 3, moreover the bottom
left point has dosage<3 and effectiveness
=0, whereas rest of the points have
dosage>3 and average effectiveness =38.8
so we can write:

To find the best value to write in the Root


Node we keep calculating the Average
dosage of various groups. This gives new
values in the rest 2 nodes, the efficiency of
the is calculated by calculating the value
of the SUM of SQUARED RESIDUALS of all
the points.
The case in which the sum of squared
residuals is minimum is the most effective.
Eg->The 4 points at the bottom left are
selected their average effectiveness is 0,
and average dosage is A, the average
effectiveness of rest of the points is B, thus
a line y=B is drawn and the sum of square
of residuals is calculated for it if it turns
out to be minimum then we have:
Root Node: Dosage<A

1)For the given data the Root Node comes


out to be Dosage<14.5
2)To find the left node we focus on the data
with dosage less than 14.5, and perform
the same process as before to select the
complete the left node.
3)Using the given data the left node comes
out to be Dosage<11.5, now as there is no
other data with 14.5>Dosage>11.5, so the
left node has only one child on the right
side Average=20
4)Using the steps 2 and 3 the left side is
completed.
5)But this model predicts the training data
accurately, an there is a chance that is will
overfit so we reverse our steps, to avoid
overfitting and the Output will be a single
node with Drug Effectiveness that is
Average of all the points with dosage<14.5
so we get:

6)The steps 3 and 4 are performed to


complete the right node, and if the right
side fits the training data side accurately, 5
steps performed to avoid overfitting but
this time the data includes the point whose
Dosage>14.5
COST COMPLEXITY PRUNING
The main idea behind pruning a Regression
Tree is to avoid overfitting the Training
data so the tree thus a better job with the
Testing data.
Cost Complexity Pruning helps in selecting
the nodes to be pruned.
1)The first step in Cost Complexity Pruning
is calculating ∑ (Residuals) 2 for each tree.

To find the ∑ (Residuals) for the given the tree we


add ∑ (Residuals)2 for each cluster. As there are
4 clusters with ∑ (Residuals)2 is 320,75,148.8,0
whose sum is 543.8= ∑ (Residuals)2 for the tree.
Similarly, we can find ∑ (Residuals)2 for a sub
tree:
The given sub tree has 3 clusters each has
∑ (Residuals)2 having values 320,75,5099.8 so the
∑ (Residuals)2 for the sub tree is 5494.8
Similarly, se can get the ∑ (Residuals) 2 for we
reaming sub trees as 19243.7 and 28897.2
So, the ∑ (Residuals) 2
is 543.8, 5494.8, 19243.7,
28897.2
2)These Trees compared by comparing the
value of Tree Scores, Tree Score=∑ (Residuals) 2
+βT, where βT is the Tree Complexity
Penalty, here T is dependent on the number
of leaf Nodes
3)If β=10000, we have the Tree score for
original Tree as 543.8+40000=40543.8 as
number of leaf Nodes of 4. Similarly, we get
the Tree Scores for the other trees as:
35494.8, 39243.7, 38897.2
4)The Tree with the lost Tree Score is
chose, thus the selected tree is:
Selecting The Best Value of β
1)The Value of β affects the Tree Score of
each Tree this may result in a change in the
final result. Thus, to find the best value of
β the given steps should be followed:
2)Build a Regression Tree that is bit to all of
the data, not just the training data and
keep β=0 for the original data.
Now prune the tree and increase the value
of β to λ1
Now again prune the tree and increase the
value of β to λ2
Perform the same step again until only one
single load is left.
2)Now return to the full data and divide it
into training and testing datasets, and just
using the training data using the β values
estimated before to build a full tree and a
sequence of subtree that minimize tree
score
Here values of β are 0, 10000, 15000,
22000 and calculate the tree score for each
of these trees.
3)Now perform the step 2 and 3 using the
testing data set ,also calculate the tree
score
4)Now create new training and testing
dataset and form new tree and new
subtree, rest all the steps should be the
same as above moreover the value of β
should also be the same.

Divide the above process 10 times and the


value of β for which the tree and subtree
get the min tree score with the testing data
is the Best value of β
CONFUSION MATRIX
The rows in a Confusion Matrix correspond
to what the Machine Learning Algorithm
Predicted
The columns in a Confusion Matrix
correspond to the known Truth.
True Positives(TP): Patients have heart
disease and correctly identified by
Algorithm
True Negatives(TN): Patients do not have
heart disease and correctly identified by
algorithm
False Positives(FP): Patient that do not
have heat disease but Algorithm said they
do have heat disease.
False Negatives(FN): Patient had heart
disease but algorithm said they did not.
CROSS VALIDATION
It is usually used to select the Machine
Leaning Model to be used. The steps are:
1)Combine the entire data, the entire data
includes a combination of testing data and
the training data.
2)The entire data is divided into equal
blocks of 25%.
3)Any Machine Learning Model say Logistic
Regression is chosen and 75% of the data is
used to train the model and 25% of the
data is used to test the model.
4)The result is collected and stored
5)The same model (Logistic Regression) is
given new data and the above process is
repeated 4 times, the final result is the
average of all 4 results.
Once, a Model like Logistic Regression is
checked it is applied for other Machine
Learning Models, the Model giving the best
result is selected as the model to be used,
RANDOM FOREST
The random Forest includes large number
of Decision Trees build from a dataset,
however all these decision are not the
same as they have different Internal Nodes
and Leaf Nodes, usually the number of
Decision Trees in a Random Forest is 100.
As a Result when the model is tested the
output from various decision Trees is
evaluated and the final majority result is
the final Output
The various Decision Trees are created by
considering a subset of variables from the
dataset provided.
Bagging: Bootstrapping the data plus using
the aggregate to make a decision is called
‘Bagging’
Determining the Missing Data
1)If we have some amount of missing data
as sone below, the missing data is
determined by using the steps given below:
2)The original dataset has missing values in
its last rows, thus as the data belong to a
person who does not have heart disease,
and all the persons without heart disease
do not have Blacked arteries thus there are
high chances that the person in the final
row also does not have blocked arteries.

3)Similarly, all persons without heart


disease have Weight 125 and 210 thus
there are high chances that the person in
the final row has weight = Median of 125
and 210=167.5
Thus, the predicted data is:
However, to increase the accuracy of the
predicted data we use Random Forest and
thus several decision trees are constructed
using the predicted data.
1)The predicted data is run down all of the
trees, the data that give same performance
as the data present in the final row for a
given decision tree are selected.
Eg- For the first Tree the data present in
the third row and final row give same
performance.

2)The track of similar data is kept using the


proximity matrix, the proximity matrix
contains one row and one column for each
data(sample) present in the dataset
As Samples 3 and 4 performed similar for
the first tree thus number 1 is filled at
position 3,4 and 4,3 of the proximity
matrix.

3)Similarly, all data is run down the second


tree and samples 2,3 perform similar to
sample 4.

4)To keep track of these samples a 1 is


filled at position (2,3), (3,2), (2,4), (4,2),
(3,4), (4,3) in the proximity matrix.
5)Similarly, all the data is run down the
third tree, and the final updated proximity
matrix is:

6)Once the data is run down all the trees


and the proximity matrix is completely
updated, the values in the proximity matrix
are divided by the total number of trees.
7)Let us assume, that the value of the
proximity matrix after the above step is as
shown below:
8)Once the values in the proximity matrix
are completed estimated, it is used to
estimate the missing values with accuracy.

Estimating the Result for Blocked Arteries


p(Yes)=1/3 , p(No)=2/3,
Wp(Yes)=p(Yes)x (Proximity of Yes)/(All
Proximity)
All Proximity=Proximity for Sample
4=0.1+0.1+0.8=1
Proximity of Yes= Proximity of Sample
2=0.1
Wp(Yes)=0.1/3
Wp(No)=p(No)x (Proximity of No)/(All
Proximity)
All Proximity=Proximity for Sample
4=0.1+0.1+0.8=1
Proximity of No= Proximity of Sample
1,3=0.1+0.8=0.9
Wp(No)=0.6
As Wp(No)>>Wp(Yes) so Result is No
Estimating Result for Weight
SW(1)=125, SW(2)=180, SW(3)=210
Weighted Average= ∑ Sw ( i ) x Proximity(i)
Proximity (1) =0.1/1=0.1, Proximity (2)
=0.1/1=0.1
Proximity(3)=0.8/1=0.8
Weighted
Average=12.5+18+168=198.5=SW(4)
So Final Dataset is:
As the final dataset has been created, we
create the random forest, run the data
through the trees, recalculate the
proximities and find the missing values.
ADA BOOST
In a Forest of Trees made with Ada-Boost,
the trees are usually just a node and two
leafs.
STUMP: A tree with just a single node and
two leafs is called a Stump
In a Random Forest each tree has equal
vote on the final decision, whereas in a
forest of stumps, some stumps get more
say on the final classification than other
stumps.
In the above example the larger stumps get
a larger say on the final classification than
the smaller stumps.
Moreover, in a forest of stumps the order of
stumps is important. The errors made by
first stump have influence on the second
stump, and errors made by second stump
have influence on the third stump and so
on.
CREATING ADA-BOOST
Let the given dataset be:
1)At the start each sample is provided an
identical Sample weight= 1/ (No. of
Samples) =1/8
2)We need to find whether a person has a
heart disease or not based on his Chest
Pain, Blocked Arteries, Patient Weight.
3)To create the first stump we need to
figure which parameter out of Chest Pain,
Blocked Arteries, Patient Weight thus the
best job at Classification
4)Thus 3 separate stumps with samples
Chest Pain, Blocked Arteries, Patient
Weight are constructed and Gini Impurity of
each Stump is also calculated the stump
with least Gini Impurity is selected.
Eg-> The Gini Impurity if Chest Pain is
Internal Node is 0.466
Eg->The Gini Impurity if Blocked Arteries is
Internal Node is 0.5
Eg-> The Gini Impurity if Weight>176 is
Internal Node is 0.2
So, the first Stump in the forest is:

As in the Ada Boost different stumps may


have different say in the final classification,
thus we need to determine the amount of
say this stump has in the final classification
Amount of Say= 1/2 [1-Total Error/Total
Error]
The above stump made one error at:
Thus, the total error is 1/8 so Amount of
Say=1/2In(7)=0.97
Similarly, the Amount of Say for the given
Stump is:

Amount Of Error=3/8, Amount of Say=0.253


Similarly, the Amount of Say for the given
Stump is:

Amount of Error=0, Amount of Say=∞


As each stump as different amount of say
thus the sample weight of each sample
must be changed, this is done by
decreasing the sample weights of samples
which were correctly classified and
increasing the sample weights of the
samples which were incorrectly classified.
New Sample Weight of the Sample which
were incorrectly classified =(Sample
Weight)(eAmountof Say)
New Sample Weight of the Sample which
were correctly classified =(Sample Weight)
(e-Amount of Say)
Eg-> For the given Stump the Amount of
Say is 0.97, thus, eAmount of Say=2.63, e-Amount of
Say
=0.38

So, the New Sample Weight for the


incorrectly classified Sample is 0.32875
So, the New Sample Weight for the
correctly classified Sample is 0.0475
So, the New Sample Weights are:
But, we need to normalize all the sample
weights so that the sum of all the sample
weights is 1, thus after normalizing the
sample weights we get:

K-Means Clustering
Imagine we had some data that we can plot
on a line, and we need to put it into various
clusters.
Step -1: Select the number of clusters, we
need to identify in the data. This is the K in
K-Means Clustering
Step-2: Randomly select K distinct data
points from the given data.

In the above example K=3, so 3 distinct


data points are selected.
Step-3: The selected points are the initial
clusters
Step-4: Measure distance from the first
point to each initial cluster

Step-5: Assign the first point to the nearest


cluster. As the nearest cluster to the first
point is blue so the first point is assigned
to the blue cluster.

Repeat the same process for all the other


points. Thus, assign each point to its
nearest cluster.

Step-6: Calculate the mean of each cluster.

The line indicates the position of mean of


each cluster
Step-7: Assign the points to the cluster,
whose mean is closest to them.
Step-8: If the result doesn’t change the
process is completed.
Step-9: To get the best result perform all
the above steps again with different initial
clusters
GRADIENT BOOSTING CLASSIFIER
The Gradient Boost starts by making a
single leaf instead of a tree or a stump, this
leaf represents an initial guess for the
Weights of all of the samples.
Then a Tree is built, like Ada Boost this tree
is based on the errors made by the
previous tree. But unlike, Ada Boost this
Tree is larger than a stump.
Once the tree is built the tree is scaled,
however all the trees are scaled by the
same amount.
Then another tree is built based on the
errors made by the previous tree.
This process continues….
For the training data given below, we need
to predict weight using Gradient boost:
1)Calculate the average weight, Average
Weight=71.2
2)Build a tree based on the errors made,
Error = (Observed Weight-Average Weight),
save the Error called Pseudo Residual in a
column.
3)Now a Tree is build using Height, Favorite
Color, Gender to predict the Residuals

The 1.8,5.8 is replaced by there mean 3.8


and -14.2,-15.2 is replaced by its mean -
14.7 thus we get:

Now the above tree can be used to make


predictions based on the Gender, Height,
Color of an individual from the training
data. Let the training data be:

The above tree gives


Weight=71.2+16.8=88 as the model fits the
training data too well. Thus we have loo
Bias and high Variance
Gradient Boost deals with low Bias by using
Leaning Rate, with is a value between 0
and 1.
Predicted Weight=Average Weight+
(Learning Rate)(Prediction)
Average Weight=71.2, Prediction=16.8,
0<Learning Rate<=1
If Learning Rtae=0.1 we get:
Predicted Weight=71.2+(0.1)(16.8)=72.88
4)To build the next tree, Pseudo Residuals
are calculated by subtracting the Predicted
values of Previous Tree from the Observed
Values.
Pseudo Residual=Observed-Predicted
Weight
Pseudo Residual=Observed-71.2-(0.1)
(Prediction)
71.2=Average Weight, 0.1=Learning Rate
5)The new Tree is built:
The 1.4,5.4 is replaced by there mean 3.4
and -12.7,-13.7 is replaced by its mean -
13.2 thus we get:

6) Now the previous and the new tree is


used to make predictions of the same
dataset:

Predicted Weight=Average Weight+∑ (x)(Pi)


Average Weigt=71.2, x=Learning Rate=0.1,
Pi= Prediction made by ith tree, P1=16.8,
P2=15.1
Predicted Weight=71.2+(0.1)(16.8)+(0.1)
(15.1)=74.39
PCA (Principal Component Analysis)
Imagine we have measured the
transcription of two genes Gene1 and
Gene2.
As we have measured the data for two
Genes we can plot the data on a two
dimensional graph. Mice 1,2,3 cluster the
upper right side and Mice 4,5,6 cluster the
lower left side.
After plotting the 2D graph:
1)We take the average measurements for
Gene1 and Gene2, with these average
values we can calculate the center of the
data.
2)Now the data is shifted so that the center
of the of the data is at the origin of the
graph.

Shifting the data should not change the


relative position of the points on the graph.

You might also like