0% found this document useful (0 votes)
56 views7 pages

ML Models and When To Choose One Over Others

This document provides an overview of machine learning algorithms, focusing on linear regression, its formulation, and regularization methods to prevent overfitting. It explains the importance of selecting appropriate models and hyperparameters, while emphasizing the need for foundational mathematical knowledge. The target audience is data science learners with basic understanding of key machine learning concepts, and it includes references for further reading.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views7 pages

ML Models and When To Choose One Over Others

This document provides an overview of machine learning algorithms, focusing on linear regression, its formulation, and regularization methods to prevent overfitting. It explains the importance of selecting appropriate models and hyperparameters, while emphasizing the need for foundational mathematical knowledge. The target audience is data science learners with basic understanding of key machine learning concepts, and it includes references for further reading.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MACHINE LEARNING MODELS, REGULARIZATION AND WHEN TO

CHOOSE ONE OVER OTHERS


Purpose of this document: This document aims to explain available machine
learning algorithms, how they work, what their input and output are and popular
methods to regularize them to avoid overfitting. Furthermore, since there may be
several alternatives to solve a machine learning problem, this document also gives
hint when to choose one over others . Finally, this document will try to explain
without diving too much into the mathematics lying behind each
algorithm/regularization method
Disclaimer: This document only focuses on machine learning algorithms
themselves, how to select among various alternatives and how to tweak the model
hyperparameters to obtain optimal result. This document does not contain any
data pre-processing or feature engineering techniques to obtain optimal result.
Plus, although this document can be a shortcut to understand and effectively use
ML models, mathematics (calculus, statistics, linear algebra) should not be skipped
in order to obtained sustainable learning curve.
Target audience: data science learners who just have basic knowledge in calculus,
linear algebra and statistics (who might have their heads exploded like me when
trying to find explanation for any mathematics formula behind each
algorithm/regularization :D)
AND who have already had basic idea of the below topics
# Topic
1 What is machine learning?
2 Category of machine learning (Supervised, Unsupervised, Re-
enforcement)
3 Bias-Variance tradeoffs, Overfitting and underfitting
4 Train and test techniques, cross validation techniques
In case you are not familiar with the above topics, you can take reference from my
below Reference sources. They are great sources to gain further insights
Reference sources: Special thanks to the creators/authors of these below sources
- Dataquest inc.
- Vidhya Analytics
- Khan Academy
- An Introduction to Statistical Learning with Applications in R (Gareth J,
Daniela Witten, et al.)
- Practical Statistics for Data Scientists (Peter Bruce, Andrew Bruce, et al.)
- Hands-on machine learning with scikit-learn, keras, and tensorflow:
concepts, tools, and techniques to build intelligent systems 2nd edition
(Aurélien Géron)
- Data Mining Concepts & Techniques (Jiawei Han, Micheline Kamber)

Version tracking:
Version number Version description
001 First draft of this document and only
contains Linear Regression

Creator: Bao Quach


A. LINEAR REGRESSION
I. What is linear regression?
Linear regression is a supervised machine learning model which tries to predict a
quantitative response by taking in several independent variables/predictors as input.
These independent variables can be either numerical or categorial. The idea of linear
regression is that there is linear correlation between independent variables and
quantitative response. The final output of linear regression is the below formula:

o y: predicted quantitative response/outcome


o x1…xn: values of independent variables
o O1…On: The coefficient/magnitude/correlation of independent variables with
the outcome
o O0: the intercept/default value of y when all values of independent variables
equal to 0

II. How to form the above linear regression formula?


- The key in the above formula is to find out the optimal set of O (both the intercept
and coefficients). So how to define ‘optimal’? Optimal outcome is achieved when we
can minimize the difference between our predicted outcomes with the real outcomes
using our estimated model.
- Common metrics to calculate the difference between predicted outcomes and real
outcomes include Mean Squared Error (MSE) or Root Mean Squared Error (RMSE):

RMSE is just taking the square root of MSE


MSE is also called the Cost Function of Linear Regression model
- There are 2 common approaches to find the optimal O in order to minimize the Cost
Function:
+ Ordinary least square: Based on some calculus (how to find minima) and linear
algebra (to transform vectors and matrix), the optimal O vector is estimated:

+ Gradient Descent: An iterative approach which randomly selects 1 initial O


vector and iteratively subtracts it with the gradient vector of the cost function multiplying by a
pre-defined learning rate. The below graph generalizes the idea of Gradient Descent

- n: pre-defined learning rate


- The upside-down triangle: The gradient vector of the cost function. In natural
language, a gradient vector of a function can be explained as the direction of steepest
ascent at any point. Imagine you are at the foot of a hill or at any point on a hill and
you want to get to the top, gradient vector guides you the direction to get to top by
stepping into the steepest ascent at your standing point
- Since the gradient vector helps direct you moving upward. Minus of gradient vector
will help direct you moving downward to the foot of the hill (on other words, to the
minima of the cost function). That explains why equation 4-7 subtracts O by the
gradient vector instead of adding
- Gradient descent also has several alternatives to execute including batch gradient
descent, mini batch gradient descent and Stochastic gradient descent. The main
difference between these alternatives is the number of instances/observations to
calculate the gradient vector of the cost function. While batch gradient descent
utilizes the whole training set to calculate, Stochastic will randomly select one
instance and mini batch randomly selects a subset of the training set. This difference
directly impacts the speed of training among alternatives
- When to use Ordinary Least Square and when to use Gradient Descent?
There are 2 indicators to consider when selecting:
+ Accuracy: OLS obviously has higher accuracy than Gradient Descent
+ Computational complexity: OLS formula is quite complex in terms of computation
(especially when the number of predictors is extensively large). Computational
complexity estimation is not included here
 When there are too many predictors, using OLS may consume extensively long
time to train the model. Thus, in such situation, Gradient Descent is preferred
while estimated O obtained from GD is fairly close to the optimal one obtained
from OLS.

III. Regularized linear models:


The purpose of regularizing linear models lies behind the bias and variance tradeoff. Linear
models built without regularized will tend to overfit the model by estimating the O too specific
to the training set that it fails to predict new observations. A symptom for overfitting is when
RMSE of training set is lower than RMSE of validation set/test set. The sources of overfitting in
non-regularized linear models comes from:
- Multicollinearity
- Non-regularized models treat each predictor with equal importance
The general approach of regularize linear models is to add a regularized term into the cost
function which will force the model to minimize not only the ordinary cost function but also the
newly added regularized term. By regularization, sources of overfitting will be minimized
1. Ridge regression:
Ridge regression adds l2 norm of O vector as a regularized term into the cost function and then
estimate vector O using OLS or Gradient Descent as usual.

- Alpha: hyperparameter control the regularization scale. When alpha = 0, it is non-


regularized linear model. Whereas alpha moves to infinity, O vector will shrink to 0
(but can never equal to 0).
- ∑O^2: l2 norm of O vector
 Unlike non-regularized linear model which will have only 1 estimated O vector,
ridge regression will have multiple estimation depending on pre-defined alpha.
It is crucial to define an optimal alpha to minimize the gap between training
RMSE and validation/test RMSE through iteratively trying different alpha.
2. Lasso regression:
Lasso regression adds l1 norm of O vector as a regularized term into the cost function

 The difference between Ridge and Lasso is that Lasso can shrink a coefficient to
0 (while Ridge can only shrink near 0). This also indicates that Lasso regression
can fully eliminate predictors which seem to be less useful
3. Elastic net:
Elastic net is a combination of Ridge and Lasso by adding both l1 and l2 norm to the cost
function
- When to use non-regularized linear models, when to use Ridge
regression/Lasso/Elastic net?
It is suggested always applying a bit of regularization to the linear models. By default,
it will be Ridge regression. However, if you think that just several predictors are
useful, it is better to use Lasso or Elastic net since they can totally eliminate less-useful
predictors from the equation.
When number of predictors is greater than number of observations or there is
multicollinearity between predictors, Elastic net is a better choice.
4. Early stop in gradient descent:
Early stop implements a different approach in regularization. It tries to stop the iteration as
soon as reaching the minimum RMSE.

You might also like