Interview Prep:
Why learn ml because every problem is not deterministic and so we need to learn the randomness and
there comes machine learning e.g. in handwriting recognition not every body writes 2 in similar manner
There are two parts of it training and prediction, training is the process of learning from historical data
it’s basically a mapping from the input to the output’s we want to predict and it return a model to us
And in prediction we predict from the test data
Basics:
1. Supervised Learning:-
Talking about types of supervision is for two types classification (In this Y(out output is categorical)
and Regression in which our output Y is (numeric)
Categorical-> Logistic regression, Decision Tree, Random forest, support vector machines(SVM) naïve
bayes etc
Regression involves -> Linear Regression ,Regression trees, and Kernel Regression
Loss function tells us that how good or bad a model is on test data
We don’t want to fit to noise of the data we want to fit to the pattern of the data
Both overfitting and under fitting are not good for our model so how do we solve that and for that
we take help of regularization
L1- regularization what it tries to do is that it tries to push small and insignificant weight values to
zero
L-2 also does same but it’s not as aggressive as L-1 it tries somekind of uniform reduction in all the
coefficients involved
But question comes to we quantify a fit whether a particular fit is good fit or bad fit
And their comes the bias variance and tade off to measure this
Overfitting is the case where we fit the model with higher complexity than required by the data and
in underfitting we fit the model with lower complexity than it is required by the data
But in practice how would you determine a particular fit is good fit or bad fit
Given any data set we will split it into test sample and train sample and you train it on the train data
and test it on the test and from the we will try be somewhere in the middle pick that model in which
the error on the test sample is lowest
It is also same as linear regression only but in this we add some regularization to diagonal elements
likelihood is the product of all the probabilities but in theory it is possible but in practical it can lead
to various numerical underflow issue so we take summation of log off probabilities which also same
and this is called conditional log likelihood but problem is that we cannot solve this types of
regression that is cannot minimize the function analytically so we follow iterative solvers
the trick we use is that we start with some initial value of W and then we move in the direction in
value if J is down we calculate the gradient and move in it’s opposite direction ita is the learning rate
So when the size of the data is very large like millions of data then computing the gradient is
expensive too much so what we say is that can we get an estimate of the direction in which I need to
move so why not plug one sample and this is stochastic gradient descent but the problem is too
much variance so why not so something in between so we go for batch gradient descent
Now we will move on to next topic tree models
It is more like if else statement
No of leafs corresspond to number of different regions of the graph
Let’s talk about how we go on building that decision trees it basically a greedy divide and conquer
alogortihm we check like for the current node which feature gives most satisfying classification and
we break the tree there based on that feature
We will try to choose that which result in greatest decrease in the impurity of the nodes
Before talking about what are the criteria for measuring impurity let’s take example
We have used gini impurity value so we see 0.34<.49 so second is best split reemeber high gini
impurity is bad so we don’t want that
The deeper the tree is higher the chance is that there is overfiting so we need to stop but when to
stop
Let’s see how decision trees are used in real model so what we do is use ensemble of trees for
We do ensemble learning through bagging and boosting
In bagging we make mutiple decsion tree with low vbias and high variance and then we predict and
take average and this will reduce the overal variance giving low bias and low variance
I boosting what we do is that we tain on one model and predict from it and then for next model we
use the error of the first model in the learning of the second model it is iterative method
Each of the matrix is called the bootstrap and we just take the avrage the prediction of the models
Why it works is beacause it aggration of high variance giving it low variance
The idea is to start from the weak learners in the adaboost and then go on building up the model
complexity each weak learner is very high bias model and famous boosting algorithm is adaboost
I each iteration we get a weak learner and error weight and final model will be linear combination of
weight and learner
Why it is called gradient beacause the way we calculate the loss function in that
Ratio of (tpr+tnr)/2 is balanced accuracy
Reciever operator characterstics (ROC)
It assumes the independence of the features from the other one
It’s not training it’s learning from conditional probability and so it’s very fast
It sufffers from the curse of dimensionality as we go to higher dimension the number of samples in
cubic region goes on reducing
As the dimension increases more and more sample have same distance where it would fail so it’s
better to work in low dimension
The dotted line is support vectors
If in the current dimension the data is not separable we use kernel to to transform data and separate
in that dimasion
Max pooling prevent overfitting and makes model and invariant of small features or shits
Inductive biases are the biases which we have taken in consideration in training our model we
basically feed sequentially data to the RNN
RNN HAS THE MEMORY TO REMEMBER THE PAST
One hot encoding
RNN SUFFERS FROM THE FORGETFULLNESS THAT IS THEY LOOSE THE DATA WHICH CAME BACK VERY
LONG IN THE TIME
LSTM COME WITH MORE PARAMETER AND BUT THEY ARE EXPENSIVE TO TRAIN
Bidirectional rnn says not only past is important but future is also important
In attention all hidden state are kept other than simple network and it’s some way like reffering back
Without position encoding the self attention will not work
It works on masking mask one or two word and then it predict the that masked word so it is very
successful we use bert on transfer learning perfora