0% found this document useful (0 votes)
56 views49 pages

Slides Learners

This document provides an overview of important machine learning algorithms, including k-Nearest Neighbors (k-NN), Generalized Linear Models (GLM), Generalized Additive Models (GAM), Classification & Regression Trees (CART), Random Forests, Gradient Boosting, Support Vector Machines (SVM), Gaussian Processes (GP), and Neural Networks (NN). For each algorithm, it discusses the method summary, hyperparameters, implementation, pros and cons.

Uploaded by

Hoàng Lê Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views49 pages

Slides Learners

This document provides an overview of important machine learning algorithms, including k-Nearest Neighbors (k-NN), Generalized Linear Models (GLM), Generalized Additive Models (GAM), Classification & Regression Trees (CART), Random Forests, Gradient Boosting, Support Vector Machines (SVM), Gaussian Processes (GP), and Neural Networks (NN). For each algorithm, it discusses the method summary, hyperparameters, implementation, pros and cons.

Uploaded by

Hoàng Lê Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Important Learning Algorithms in ML

Learning goals
General idea of important ML algorithms
Overview of strengths and weaknesses
CONTENTS
1 k -Nearest Neighbors (k -NN)

2 Generalized Linear Models (GLM)

3 Generalized Additive Models (GAM)

4 Classification & Regression Trees (CART)

5 Random Forests

6 Gradient Boosting

7 Linear Support Vector Machines (SVM)

8 Nonlinear Support Vector Machines

9 Gaussian Processes (GP)

10 Neural Networks (NN)

LMU SLDS © Important Learning Algorithms in ML – 1 / 48


K -NN – METHOD SUMMARY
REGRESSION CLASSIFICATION NONPARAMETRIC WHITE-BOX

General idea
similarity in feature space (w.r.t. certain distance metric d (x(i ) , x)) similarity in target space
Prediction for x: construct k -neighborhood Nk (x) from k points closest to x in X , then predict
P1 wi y (i ) with wi = 1
P
(weighted) mean target for regression: ŷ = wi d (x(i ) ,x)
i :x(i ) ∈Nk (x) i :x(i ) ∈Nk (x)

→ optional: higher weights wi for close neighbors


I(y (i ) = `)
P
most frequent class for classification: ŷ = arg max
`∈{1,...,g } i :x(i ) ∈Nk (x)

⇒ Estimating posterior probabilities as π̂` (x(i ) ) = 1


I(y (i ) = `)
P
k
i :x(i ) ∈Nk (x)

Nonparametric behavior: parameters = training data; no compression of information


Not immediately interpretable, but inspection of neighborhoods can be revealing

LMU SLDS © Important Learning Algorithms in ML – 2 / 48


K -NN – METHOD SUMMARY
Hyperparameters Neighborhood size k (locality), distance metric (next page)
4.5 4.5 4.5
^ (xi) = (0, 0.44, 0.56)
π
4.0 4.0 4.0
[Link]

[Link]

[Link]
3.5 3.5 3.5 Classification
Left: Neighborhood for exemplary
3.0 3.0 3.0 observation in iris, k = 50
2.5 2.5 2.5 Middle: Prediction surface for k = 1
Right: Prediction surface for k = 50
2.0 2.0 2.0
5 6 7 8 5 6 7 8 5 6 7 8
[Link] [Link] [Link]

Regression
Left: Prediction surface for k = 3
Middle: Prediction surface for k = 7
Right: Prediction surface for k = 15

Small k ⇒ very local, "wiggly" decision boundaries


Large k ⇒ rather global, smooth decision boundaries

LMU SLDS © Important Learning Algorithms in ML – 3 / 48


K -NN – METHOD SUMMARY
Popular distance metrics
Numerical feature space:
P 1
q q
⇒ Typically, Minkowski distances d (x, x̃) = kx − x̃kq = j | xj − x̃j |
P
q = 1: Manhattan distance → d (x, x̃) = j |xj − x̃j |
qP
q = 2: Euclidean distance → d (x, x̃) = j (xj − x̃j )
2

Visualization: Manhattan (red, blue, yellow) vs. Euclidean (green)

Mixed feature space:


Gower distance can handle numerical and categorical features, and missing data:
|xi − xj |
- numerical: d (xi , xj ) =
( x ) − min(x )
max(
1, if xi 6= xj
- categorical: d (xi , xj ) =
0, if xi = xj
- Gower distance as average over individual scores
Optional weighting to account for beliefs about varying feature importance

Figure Source: [Link]

LMU SLDS © Important Learning Algorithms in ML – 4 / 48


K -NN – IMPLEMENTATION & PRACTICAL HINTS
Preprocessing Features should be standardized or normalized

Implementation
R: mlr3 learners (calling kknn::kknn())
Classification:
- LearnerClassifKKNN
- fnn::knn()
Regression:
- LearnerRegrKKNN
- fnn::[Link]()
Nearest Neighbour Search in O(N log N ): RANN::nn2()
Python: From package [Link]
Classification:
- KNeighborsClassifier()
- RadiusNeighborsClassifier() as alternative if data not uniformly sampled
Regression:
- KNeighborsRegressor()
- RadiusNeighborsRegressor() as alternative if data not uniformly sampled

LMU SLDS © Important Learning Algorithms in ML – 5 / 48


K -NN – PROS & CONS
Advantages Disadvantages
+ Algorithm easy to explain and implement − Sensitivity w.r.t. noisy or irrelevant features and
+ No distributional or functional assumptions outliers due to dependency on distance measure
→ able to model data of arbitrary complexity − Heavily affected by curse of dimensionality
+ No training or optimization required − Bad performance when feature scales are not
+ local model → nonlinear decision boundaries consistent with feature relevance
+ Easy to tune (few hyperparameters) − Poor handling of data imbalances (worse for
→ number of neighbors k , distance metric more global model, i.e., large k )
+ Custom distance metrics can often be easily
designed to incorporate domain knowledge

LMU SLDS © Important Learning Algorithms in ML – 6 / 48


GENERALIZED LINEAR MODELS – METHOD SUMMARY
REGRESSION CLASSIFICATION PARAMETRIC WHITE-BOX FEATURE SELECTION

General idea Represent target as function of linear predictor θ > x (weighted sum of features)
→ Interpretation: if feature xj increases by 1 unit, the linear predictor changes by θj units
H = f : X → R | f (x) = φ(θ > x) , with suitable transformation φ(·), e.g.,

Hypothesis space
Linear Regression: Y = R, φ identity
Logistic Regression: Y = {0, 1}, logistic sigmoid φ(θ > x) = 1
1+exp(−θ > x)
=: π(x | θ)
⇒ Decision rule: Linear hyperplane

1.00

0.75
s(f)

0.50

0.25

0.00
−10 −5 0 5 10
Logistic function for bivariate input and
loss-minimal θ
f

Linear regression hyperplane Logistic sigmoid function Corresponding separating hyperplane

LMU SLDS © Important Learning Algorithms in ML – 7 / 48


GENERALIZED LINEAR MODELS – METHOD SUMMARY
Loss functions
Lin. Regr.:
Typically, based on quadratic loss (OLS estimation):
2
L (y , f ) = (y − f )

Log. Regr.: Based on bernoulli / log / cross-entropy loss


Loss based on scores

L (y , f ) = ln (1 + exp (−y · f )) for y ∈ {−1, +1}


L (y , f ) = −y · f + log (1 + exp (f )) for y ∈ {0, 1}

Loss based on probabilities:

L (y , π) = ln (1 + exp (−y · log (π))) for y ∈ {−1, +1}


L (y , π) = −y log (π) − (1 − y ) log (1 − π) for y ∈ {0, 1}

LMU SLDS © Important Learning Algorithms in ML – 8 / 48


GENERALIZED LINEAR MODELS – METHOD SUMMARY
Optimization
Minimization of the empirical risk
For OLS: analytical solution θ̂ = (X> X)−1 X> y
For other loss functions:
Log. Regr.: Convex problem, solvable via second-order optimization methods (e.g. BFGS)
Else: Numerical optimization

Multi-class extension of logistic regression


g
Estimate class-wise scoring functions: ⇒ π : X → [0, 1]g , π(x) = (π1 (x), . . . , πg (x)),
P
πk (x) = 1
k =1
Achieved through softmax transformation: πk (x | θ) = exp(θk> x) >
 Pg
j =1 exp(θj x)
g
P
Multi-class log-loss: L (y , π(x)) = − I{y =k } log(πk (x))
k =1
Predict class with maximum score (or use thresholding variant)

LMU SLDS © Important Learning Algorithms in ML – 9 / 48


GENERALIZED LINEAR MODELS – REGULARIZATION
General idea
Unregularized LM: risk of overfitting in high-dimensional space with only few observations
Goal: avoidance of overfitting by adding penalty term

Regularized empirical risk


Empirical risk function plus complexity penalty J (θ),
controlled by shrinkage parameter λ > 0:
Rreg (θ) := Remp (θ) + λ · J (θ)
Ridge regression: L2 penalty J (θ) = kθk22
LASSO regression: L1 penalty J (θ) = kθk1

Optimization under regularization


Ridge: analytically with θ̂Ridge = (X> X + λI )−1 X> y
LASSO: numerically with, e.g., (sub-)gradient descent
Choice of regularization parameter
Standard hyperparameter optimization problem
E.g., choose λ with minimum mean cross-validated error

LMU SLDS © Important Learning Algorithms in ML – 10 / 48


GENERALIZED LINEAR MODELS – REGULARIZATION
Ridge vs. LASSO
Ridge performs better for Lasso performs better for
Ridge correlated features: uncorrelated features:

Global shrinkage ⇒ overall smaller but still dense θ β = (2, . . . , 2, 0, . . . , 0)


| {z } | {z }
β = (2, 2, 2, 0, . . . , 0)
| {z }
5 5 7
Applicable with large number of influential features, cor(Xi , Xj ) = 0, ∀i 6= j
cor(Xi , Xj ) = 0.8|i −j | , ∀i , j
handling correlated variables by shrinking their
coefficients by equal amount
LASSO
Actual variable selection by shrinking coefficients for
irrelevant features all the way to zero
Suitable for sparse problems, ineffective with correlated
features (randomly selecting one)
Neither overall better ⇒ compromise: elastic net
Weighted combination of Ridge and LASSO
Introducing additional penalization coefficient:
Rreg (θ) = Remp (θ) + λ · Pα (θ), with
Pα (θ) = [α · kθk1 + (1 − α) · 12 · kθk22 ]

LMU SLDS © Important Learning Algorithms in ML – 11 / 48


GENERALIZED LINEAR MODELS – IMPLEMENTATION
Implementation
R:
Unregularized: mlr3 learner LearnerRegrLM, calling stats::lm() / mlr3 learner
LearnerClassifLogReg, calling stats::glm()
Regularized / ElasticNet: mlr3 learners LearnerClassifGlmnet / LearnerRegrGlmnet, calling
glmnet::glmnet()
For large classification data: mlr3 learner LearnerClassifLiblineaR, calling
LiblineaR::LiblineaR() uses fast coordinate descent
Python: From package sklearn.linear_model
Unregularized:
- LinearRegression()
- LogisticRegression(penalty = None)
Regularized:
- Linear regression: Lasso(),Ridge(),ElasticNet()
- Logistic regression: LogisticRegression(penalty = {‘l1’, ‘l2’, ‘elasticnet’})
Package for advanced statistical models: [Link]

LMU SLDS © Important Learning Algorithms in ML – 12 / 48


GENERALIZED LINEAR MODELS – PROS & CONS
Advantages Disadvantages
+ Simple and fast implementation − Nonlinearity of many real-world problems
+ Analytical solution for L2 loss − Further restrictive assumptions: linearly
+ Applicable for any dataset size, as long as independent features, homoskedastic residuals,
number of observations  number of features normality of conditional response
− Sensitivity w.r.t. outliers and noisy data
+ Flexibility beyond linearity with polynomials,
(especially with L2 loss)
trigonometric transformations, interaction terms
− Also a LM can overfit (e.g., many features and
etc.
few observations)
+ Intuitive interpretability via feature effects − Feature interactions must be handcrafted
+ Statistical hypothesis tests for effects available → practically infeasible for higher orders

LMU SLDS © Important Learning Algorithms in ML – 13 / 48


GENERALIZED ADDITIVE MODELS – METHOD SUMMARY
REGRESSION CLASSIFICATION (NON)PARAMETRIC WHITE-BOX FEATURE SELECTION

General idea
Same as GLM, but introduce flexibility through nonlinear (smooth) effects fj (xj )
Typically, combination of linear & smooth effects
Smooth effects also conceivable for feature interactions
( !)
p
P
Hypothesis space H= f : X → R | f (x) = φ θ0 + fj (xj ) , with suitable transformation φ(·),
j =1

intercept term θ0 , and smooth functions fj (·)

100

0
partial effect

−100

−200

−300

0.00 0.25 0.50 0.75 1.00


humidity

Prediction of bike rentals from smooth term of humidity (left: partial effect) and linear term of temperature (right: bivariate prediction).

LMU SLDS © Important Learning Algorithms in ML – 14 / 48


GENERALIZED ADDITIVE MODELS – METHOD SUMMARY
Smooth functions
Nonparametric/semiparametric/parametric approaches conceivable
Frequently: express fj as weighted sum of basis functions model linear in weight coefficients again
Use fixed basis of functions b1 , . . . , bK and estimate associated coefficients γ1 , . . . , γK
PKj
fj (xj ) = k =1 γj ,k bk (xj )

Popular types of basis functions


Polynomial smoothing/TP-/B-splines
Radial Kriging
Trigonometric Fourier/wavelet forms
Alternatives: local regression (LOESS), other kernel-smoothing approaches, . . .
weighted B−splines

weighted B−splines
0.6 50
B−splines

0
0.4 0

0.2 −50 −100

0.0 −100
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x humidity humidity

Left: B-spline basis with 9 basis functions. Middle: BFs weighted with coefficients estimated for humidity. Right: sum of weighted BFs in black (= partial effect).

LMU SLDS © Important Learning Algorithms in ML – 15 / 48


GENERALIZED ADDITIVE MODELS – METHOD SUMMARY
Regularization
Smooth functions possibly very flexible regularization vital to prevent overfitting
Control smoothness
Basis-function approaches: control number; impose penalty on coefficients (e.g., magnitude or
differences between coefficients of neighboring components) & control associated hyperparameter
Local smoothers: control width of smoothing window (larger smoother)

Prediction surfaces for bike rentals with 9 (left)


and 500 (right) basis functions in smooth
humidity term. Higher number of basis functions
yields more local, less smooth model.

Loss functions Same as in GLM essentially: use negative log-likelihood

Optimization
Coefficients (of smooth + linear terms): penalized MLE, Bayesian inference
Smoothing hyperparameters: typically, generalized cross-validation

LMU SLDS © Important Learning Algorithms in ML – 16 / 48


GENERALIZED ADDITIVE MODELS – IMPLEMENTATION
Implementation
R: mlr3 learner LearnerRegrGam, calling mgcv::gam()
Smooth terms: s(..., bs="<basis>") or te(...) for multivariate (tensorproduct) effects
Link functions: family={Gamma, Binomial, ...}
Python: GLMGam from package statsmodels; package pygam

Advantages Disadvantages
+ Simple and fast implementation − Sensitivity w.r.t. outliers and noisy data
+ Applicable for any dataset size, as long as − Feature interactions must be handcrafted
number of observations  number of features → practically infeasible for higher orders
+ High flexibility via smooth effects − Harder to optimize than GLM
+ Easy to combine linear & nonlinear effects − Additional hyperparameters (type of smooth
+ Intuitive interpretability via feature effects functions, smoothness degree, . . . )
(though not quite as straightforward as in GLM)
+ Statistical hypothesis tests for effects available

LMU SLDS © Important Learning Algorithms in ML – 17 / 48


CART – METHOD SUMMARY
REGRESSION CLASSIFICATION NONPARAMETRIC WHITE-BOX FEATURE SELECTION

General idea (CART – Classification and Regression Trees) 1


setosa
.33 .33 .33
Start at root node containing all data 100%
3
versicolor
yes [Link] < 2.5 no .00 .50 .50
Perform repeated axis-parallel binary splits in feature 2
setosa
6
versicolor
67% 7
virginica
[Link] < 1.8
space to obtain rectangular partitions at terminal nodes 1.00 .00 .00
33%
.00 .91 .09
36%
.00 .02 .98
31%

Q1 , . . . , QM Iris Data

Splits based on reduction of node impurity 2.5

→ empirical risk minimization (ERM) 2.0

In each step: Response

[Link]
1.5
setosa
Find optimal split (feature-threshold combination) versicolor

1.0 virginica
→ greedy search
Assign constant prediction cm to all obs. in Qm 0.5

→ Regression: cm is average of y 0.0


2 4 6

→ Classif.: cm is majority class (or class proportions) [Link]

Stop when a pre-defined criterion is reached


→ See Complexity control
 M

P
Hypothesis space H= f (x) : f (x) = cm I(x ∈ Qm )
m =1

LMU SLDS © Important Learning Algorithms in ML – 18 / 48


CART – METHOD SUMMARY
Empirical risk N N1 N2

Splitting feature xj at split point t divides a parent node N


c1
into two child nodes: ⇒

y
c

c2

N1 = {(x, y ) ∈ N : xj ≤ t } and N2 = {(x, y ) ∈ N : xj > t } xj xj t

Compute empirical risks in child nodes and minimize their sum to find best split (impurity reduction):
arg minj ,t R(N , j , t ) = arg minj ,t R(N1 ) + R(N2 )
|N |
Note: If R is the average instead of the sum of loss functions, we need to reweight: |Nt | R(Nt )
In general, compatible with arbitrary losses – typical choices:
g-way classification:
Brier score → Gini impurity Bernoulli loss → entropy impurity
g g
πˆk (N ) (1 − πˆk (N ) ) πˆk (N ) log πˆk (N )
P P P P
R(N ) = R(N ) = −
(x,y )∈N k =1 (x,y )∈N k =1
(y − c )2 with c = |N1 |
P P
Regression (quadratic loss): R(N ) = y
(x,y )∈N (x,y )∈N
Optimization
Exhaustive search over all split candidates, choice of risk-minimal split
In practice: reduce number of split candidates (e.g., using quantiles instead of all observed values)
LMU SLDS © Important Learning Algorithms in ML – 19 / 48
CART – IMPLEMENTATION & PRACTICAL HINTS
Hyperparameters and complexity control
Unless interrupted, splitting continues until we have pure leaf nodes (costly + overfitting)
Hyperparameters: Complexity (i.e., number of terminal nodes) controlled via tree depth, minimum number
of observations per node, maximum number of leaves, minimum risk reduction per split, ...
Limit tree growth / complexity via
Early stopping: stop growth prematurely
→ hard to determine good stopping point before actually trying all combinations
Pruning: grow deep trees and cut back in risk-optimal manner afterwards

Implementations
R:
CART: mlr3 learners LearnerClassifRpart / LearnerRegrRpart, calling rpart::rpart()
Conditional inference trees: partykit::ctree()
mitigates overfitting by controlling tree size via p-value-based splitting
Model-based recursive partitioning: partykit::mob()
fits a linear model within each terminal node of the decision tree
Rule-based models: Cubist::cubist() for regression and C50::C5.0() for classification; more
flexible frameworks for fitting various types of models (e.g., GLMs) within a tree’s terminal nodes
Python: DecisionTreeClassifier / DecisionTreeRegressor from package scikit-learn

LMU SLDS © Important Learning Algorithms in ML – 20 / 48


CART – PROS & CONS
Dual purpose of CART
Exploration purpose to obtain interpretable decision rules (here: performance/tuning is secondary)
Prediction model: CART as base learner in ensembles (bagging, random forest, boosting) can improve
stability and performance (if tuned properly), but becomes less interpretable

Advantages Disadvantages
+ Easy to understand & visualize (interpretable) − Rather poor generalization
+ Built-in feature selection − High variance/instability: model can change a
→ e.g., when features are not used for splitting lot when training data is minimally changed
+ Applicable to categorical features − Can overfit if tree is grown too deep
→ e.g., 2m possible binary splits for m categories − Not well-suited to model linear relationships
→ trick for regr. with L2-loss and binary classif.: − Bias toward features with many unique values or
categories can be sorted ⇒ m − 1 binary splits categories
+ Handling of missings possible via surrogate splits
+ Models interactions, even of higher order
+ Fast computation and good scalability
+ High flexibility with custom split criteria or
leaf-node prediction rules

LMU SLDS © Important Learning Algorithms in ML – 21 / 48


RANDOM FORESTS – METHOD SUMMARY
REGRESSION CLASSIFICATION NONPARAMETRIC BLACK-BOX FEATURE SELECTION

General idea
Bagging ensemble of M tree base learners fitted on bootstrap data samples
⇒ Reduce variance by ensembling while slightly increasing bias by bootstrapping
Use unstable, high-variance base learners by letting trees grow to full size
Promoting decorrelation by random subset of candidate features for each split
Predict via averaging (regression) or majority vote (classification) of base learners
( [m]
)
M TP
1
P [m] [m]
Hypothesis space H= f (x) : f (x) = M
ct I(x ∈ Qt )
m=1 t =1
500 Tree(s) for Iris Dataset
4.5

4.0

3.5
Response

[Link]
setosa
versicolor
3.0 virginica

2.5

2.0

5 6 7 8
[Link]

Schematic depiction of bagging process Prediction surface for iris data with 500-tree ensemble

LMU SLDS © Important Learning Algorithms in ML – 22 / 48


RANDOM FORESTS – METHOD SUMMARY
Empirical risk & Optimization Just like tree base learners

Out-of-bag (OOB) error


Ensemble prediction for obs. outside individual trees’ bootstrap training sample ⇒ unseen test sample
Use resulting loss as unbiased estimate of generalization error
Mainly useful for tuning and less for model comparison as we usually compare all models uniformly by CV

Feature importance
Based on improvement in split criterion: aggregate improvements by all splits using j-th feature
Based on permutation: permute j-th feature in OOB observations and compute impact on OOB error

Hyperparameters
Ensemble size, i.e., number of trees
Complexity of base learners, e.g., tree depth, min-split, min-leaf-size
Number of split candidates, i.e., number of features to be considered at each split

⇒ frequently used heuristics with total of p features: b pc for classification, bp/3c for regression

LMU SLDS © Important Learning Algorithms in ML – 23 / 48


RANDOM FORESTS – IMPLEMENTATION & PRACTICAL HINTS
Extremely Randomized Trees
Variance of trees can be further increased by randomizing split points instead of using the optimal one
Alternatively consider k random splits and pick the best one according to impurity

Tuning
Ensemble size should not be tuned as it only decreases variance −→ choose sufficiently large ensemble
While default values for number of split points is often good, tuning it can still improve performance
Tuning the minimum samples in leafs and minimum samples for splitting can be benificial but no huge
performance increases are to be expected

Implementation
R: mlr3 learners LearnerClassifRanger / LearnerRegrRanger, calling ranger::ranger() as a
highly efficient and flexible implementation
Python: RandomForestClassifier / RandomForestRegressor from package scikit-learn

LMU SLDS © Important Learning Algorithms in ML – 24 / 48


RANDOM FORESTS – PROS & CONS
Advantages Disadvantages
+ Retains most of trees’ advantages (e.g., feature − Loss of individual trees’ interpretability
selection, feature interactions) − Can be suboptimal for regression when
+ Fairly good predictor: mitigating base learners’ extrapolation is needed
variance through bagging − Bias toward selecting features with many
+ Quite robust w.r.t. small changes in data categories (same as CART)
+ Good with high-dimensional data, even in − Rather large model size and slow inference time
presence of noisy features for large ensembles
+ Easy to parallelize − Typically inferior in performance to tuned
+ Robust to its hyperparameter configuration gradient tree boosting.
+ Intuitive measures of feature importance

LMU SLDS © Important Learning Algorithms in ML – 25 / 48


GRADIENT BOOSTING – METHOD SUMMARY
REGRESSION CLASSIFICATION (NON)PARAMETRIC BLACK-BOX FEATURE SELECTION

General idea
Sequential ensemble of M base learners by greedy forward stagewise additive modeling
In each iteration a base learner is fitted to current pseudo residuals ⇒ one boosting iteration is
one approximate gradient step in function space
Base learners are typically trees, linear regressions or splines
Predict via (weighted) sum of base learners
n o
H = f (x) : f (x) = M [m]
b(x, θ [m] )
P
Hypothesis space m =1 β

Boosting prediction function with GAM base learners for univariate Boosting prediction surface with tree base learners for iris data after 100
regression problem after 10 iterations iterations (right: contour lines of discriminant functions)

LMU SLDS © Important Learning Algorithms in ML – 26 / 48


GRADIENT BOOSTING – METHOD SUMMARY
Empirical risk
In general, compatible with any differentiable loss
Base learner in iteration m is fitted on Pseudo residuals:
∂ L(y (i ) ,f (x(i ) )) n
r̃ (i ) = − (r̃ (i ) − b(x(i ) , θ))2
P
by minimizing the L2-loss:
∂ f (x(i ) )
i =1

Optimization
Same optimization procedure as base learner, while keeping the current ensemble f̂ [m−1] fixed
⇒ Efficient and generally applicable since inner loss is always L2
[m]
β [m] is found via line search or fixed to a small constant value and combined with the leaf values ct for
[m] [m ]
tree base learners: c̃t = β [m] · ct

Hyperparameters
Ensemble size, i.e., number of base learners
Complexity of base learners (depending on type used)
Learning rate β , i.e., impact of next base learner

LMU SLDS © Important Learning Algorithms in ML – 27 / 48


GRADIENT BOOSTING – PRACTICAL HINTS
Scalable Gradient Boosting
Feature and data subsampling for each base learner fit
Parallelization and approximate split finding for tree base learners
GPU accelaration

Explainable / Componentwise Gradient Boosting


Base learners of simple linear regression models or splines, selecting a single feature in each iteration
Allows feature selection and creates an interpretable model since uni- and bivariate effects can be
visualized directly.
Feature interactions can be learned via ranking techniques (e.g., GA2 M FAST)

Tuning
Use early-stopping to determine ensemble size
Various regularization parameters, e.g., L1/L2, number of leaves, ... that need to be carefully tuned
Tune learning rate and base learner complexity hyperparameters on log-scale

LMU SLDS © Important Learning Algorithms in ML – 28 / 48


GRADIENT BOOSTING – IMPLEMENTATION
Gradient Tree Boosting
R: mlr3 learners LearnerClassifXgboost / LearnerRegrXgboost, LearnerClassifLightGBM /
LearnerRegrLightGBM
Python: GradientBoostingClassifier / GradientBoostingRegressor from package
scikit-learn, XGBClassifier / XGBRegressor from package xgboost, [Link] from package
lightgbm
⇒ LightGBM current state-of-the-art but slightly more complicated to use than xgboost
Componentwise Gradient Boosting
R: mboost from package mboost, boostLinear / boostSplines from package compboost
Python: /
⇒ mboost very flexible but slow while compboost is much faster with limited features

LMU SLDS © Important Learning Algorithms in ML – 29 / 48


GRADIENT BOOSTING – PROS & CONS
Advantages Disadvantages
+ Retains of most of base learners’ advantages − Loss of base learners’ potential interpretability
+ Very good predictor due to aggressive loss − Many hyperparameters to be carefully tuned
minimization, typically only outperformed by − Hard to parallelize ( solved by efficient
heterogenous stacking ensembles implementation)
+ High flexibility via custom loss functions and
choice of base learner
+ Highly efficient implementations exist (lightgbm /
xgboost) that work well on large (distributed)
data sets
+ Componentwise boosting: Good combination of
(a) high performance (b) interpretable model and
(c) feature selection

LMU SLDS © Important Learning Algorithms in ML – 30 / 48


LINEAR SVM – METHOD SUMMARY
CLASSIFICATION REGRESSION PARAMETRIC WHITE-BOX

General idea (Soft-margin SVM)


Find linear decision boundary (separating hyperplane) that
maximizes distance (margin γ ) to closest points (support
vectors, SVs) on each side of decision boundary
while minimizing margin violations (points either on wrong side of
hyperplane or between dashed margin line and hyperplane)
3 types of training points
non-SVs with no impact on decision boundary
SVs that are margin violators and affect decision boundary
SVs located exactly on dashed margin lines and affect decision
boundary

H = f (x) : f (x) = θ > x + θ0



Hypothesis space (primal) Soft-margin SVM with margin violations

LMU SLDS © Important Learning Algorithms in ML – 31 / 48


LINEAR SVM – METHOD SUMMARY
Empirical risk Soft-margin SVM as L2-regularized ERM:
n   
1
kθk22 + C L y (i ) , f x(i )
P
2
i =1
kθk = 1/γ (=ˆ maximizing margin)
C > 0: penalization for margin violations
Loss aims at minimizing margin violations
→ Classif. (hinge loss): L (y , f ) = max(1 − yf , 0)
→ Regr. (-insensitive loss): L (y , f ) = max(|y − f | − , 0)

Dual problem SVMs as a constraint optimization (primal) problem (maximize margin s.t. constraints on obs.
to limit margin violations) can be formulated as a Lagrangian dual problem with Lagrange multipliers αi ≥ 0:
n n n n
1 XX
X D E X
maxn αi − αi αj y (i ) y (j ) x(i ) , x(j ) s.t. 0 ≤ αi ≤ C ∀i ∈ {1, . . . , n} and αi y (i ) = 0
α∈R 2
i =1 i =1 j =1 i =1

Solution Non-SVs have αi = 0 as they do not affect the hyperplane


n
X
f (x) = αi y (i ) hx(i ) , xi + θ0
i =1

LMU SLDS © Important Learning Algorithms in ML – 32 / 48


LINEAR SVM – METHOD SUMMARY
Optimization
Typically, tackling dual problem (though feasible in corresponding primal) via quadratic programming
Popular: sequential minimal optimization ⇒ iterative algorithm based on breaking down objective into
bivariate quadratic problems with analytical solutions

Hyperparameters Cost parameter C to control maximization of the margin vs. minimizing margin violations

Hard-margin SVM: margin is maximized by boundary on the right Soft-margin SVM: large margin and few margin violations on the right (best trade-off)

LMU SLDS © Important Learning Algorithms in ML – 33 / 48


LINEAR SVM – IMPLEMENTATION & PRACTICAL HINTS
Preprocessing Features should be scaled before applying SVMs (applies generally to regularized models)

Tuning
0 0 .5 1 1.5 2 2 .5 3 3.5 4 4.5
Tuning of cost parameter C advisable
log (100)
⇒ strong influence on resulting hyperplane
C it is often tuned on a log-scale grid for optimal and
space-filling search space 1 20 55 90 100
Implementation
R: mlr3 learners LearnerClassifSVM / LearnerRegrSVM, calling e1071::svm() with linear kernel
(libSVM interface). Further implementations in mlr3extralearners based on
kernlab::ksvm() allowing custom kernels
LiblineaR::LiblineaR() for a fast implementation with linear kernel
Python: [Link] from package scikit-learn / package libSVM

LMU SLDS © Important Learning Algorithms in ML – 34 / 48


NONLINEAR SVM – METHOD SUMMARY
CLASSIFICATION REGRESSION NONPARAMETRIC BLACK-BOX

General idea
Move beyond linearity by mapping data to transformed space where they are linearly separable
Kernel trick
No need for explicit construction of feature maps
Replace inner product of feature map φ : X → Φ by kernel: hφ(x), φ(x̃)i = k (x, x̃)

Hypothesis space
H = f (x) : f (x) = sign θ > φ(x) + θ0 (primal)
 
 n  n

P (i ) (i ) P (i )
H= f (x) : f (x) = sign αi y k (x , x) + θ0 | αi ≥ 0, αi y =0 (dual)
i =1 i =1

Nonlinear problem in original space Mapping to 3D space and subsequent linear separation – implicitly handled by kernel in nonlinear SVM

LMU SLDS © Important Learning Algorithms in ML – 35 / 48


NONLINEAR SVM – METHOD SUMMARY
Dual problem Kernelize dual (soft-margin) SVM problem, replacing all inner products by kernels:
n n n n
X 1 XX X
max αi − αi αj y (i ) y (j ) k (x(i ) , x(j ) ), s.t. 0 ≤ αi ≤ C , αi y (i ) = 0.
α 2
i =1 i =1 j =1 i =1

Hyperparameters Cost C of margin violations, kernel hyperparameters (e.g., width of RBF kernel)

Interpretation as basis function approach


Representer theorem: solution of dual soft-margin SVM
Pn
problem is θ = j =1 βj φ(x(j ) )
Sparse, weighted sum of basis functions
→ βj = 0 for non-SVs
Result: local model with smoothness depending on kernel

RBF kernel as mixture of Gaussian basis functions, forming bumpy,


nonlinear decision surface to discern red and green points

LMU SLDS © Important Learning Algorithms in ML – 36 / 48


NONLINEAR SVM – IMPLEMENTATION & PRACTICAL HINTS
Common kernels
Linear kernel: dot product of given observations ⇒ k (x, x̃) = x> x̃ ⇒ linear SVM
Polynomial kernel of degree d ∈ N: monomials (i.e., feature interactions) up to d-th order
d
⇒ k (x, x̃) = x> x̃ + b , b ≥ 0
Radial basis function (RBF) kernel: infinite-dimensional feature space, allowing for perfect separation of
all finite datasets ⇒ k (x, x̃) = exp −γkx − x̃k22 with bandwidth parameter γ > 0


Tuning
High sensitivity w.r.t. hyperparameters, especially those of kernel ⇒ tuning very important
For RBF kernels, use RBF sigma heuristic to determine bandwidth

Implementation
R: mlr3 learners LearnerClassifSVM / LearnerRegrSVM, calling e1071::svm() with nonlinear kernel
(libSVM interface), kernlab::ksvm() allowing custom kernels
Python: [Link] from package scikit-learn / package libSVM

LMU SLDS © Important Learning Algorithms in ML – 37 / 48


SVM – PRO’S & CON’S
Advantages Disadvantages
+ Often sparse solution (w.r.t. observations) − Long training times → O (n2 p + n3 )
+ Robust against overfitting (regularized); − Confined to linear model
especially in high-dimensional space − Restricted to continuous features
+ Stable solutions (w.r.t. changes in train data) − Optimization can also fail or get stuck
→ Non-SV do not affect decision boundary
+ Convex optimization problem
→ local minimum = ˆ global minimum
Advantages (nonlinear SVM) Disadvantages (nonlinear SVM)
+ Can learn nonlinear decision boundaries − Poor interpretability due to complex kernel
+ Very flexible due to custom kernels − Not easy tunable as it is highly important to
→ RBF kernel yields local model choose the right kernel (which also introduces
→ kernel for time series, strings etc. further hyperparameters)

LMU SLDS © Important Learning Algorithms in ML – 38 / 48


GAUSSIAN PROCESSES (GP) – METHOD SUMMARY
REGRESSION CLASSIFICATION NONPARAMETRIC PROBABILISTIC

General idea
GPs model a distribution over potential functions f that fit the observed data
Assumptions:
n-observations follow a n-dimensional Normal distribution
The closer observations are, the higher they are correlated
A kernel function k (x(i ) , x(j ) ) quantifies the similarity between two observations and induces the
covariance matrix of the distribution.
Predict via the maximum a-posteriori (MAP) estimate.
n h    i o
Hypothesis space H = f = f x(1) , . . . , f x(n) ∼ N (m, K ) | m ∈ Rn , K ∈ Rn×n
Functions drawn from a Gaussian process prior Posterior process after 4 observations

2 2
f(x)

f(x)
0 0

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2
x x

LMU SLDS © Important Learning Algorithms in ML – 39 / 48


GAUSSIAN PROCESSES (GP) – METHOD SUMMARY
Empirical risk
The risk is estimated by using the posterior of a conditional Normal distribution
Most kernels have length scale parameters that need to be estimated
Optimization
The kernel parameters can be learned using maximum likelihood estimation
This requires inverting the n × n -covariance matrix
Hyperparameters
The most important hyperparameter is the choice of the kernel function k (x(i ) , x(j ) )
Common kernel choices for "standard" data are:
Linear or polyomial
Squared-exponential (infinitely differentiable)
Matérn (further generalization of the Squared-exponential kernel)
Special kernels for all kind of data situation exist, e.g., a Exp-Sine-Squared kernel for periodic data
Kernels can be composed by multiplying or addition to create more expressive structures

LMU SLDS © Important Learning Algorithms in ML – 40 / 48


GP – IMPLEMENTATION & PRACTICAL HINTS
Scalable GPs for larger data
Low-rank approximations of the covariance by using only a representative subset of inducing points
Using a kernel that creates a sparse coviariance matrix

Noisy GPs
Having an interpolator might not be suitable if the data is noisy
A noisy GP adds a nugget effect to the kernel k (x(i ) , x(j ) ) + σδij , creating a Gaussian process regression
model

Implementation
R: mlr3 learners LearnerClassifGausspr / LearnerRegrGausspr, calling kernlab::gausspr()
Python: GaussianProcessClassifier / GaussianProcessRegressor from package
scikit-learn, gpytorch for a modular, scalable, efficient and GPU accelerated implementation built on
torch

LMU SLDS © Important Learning Algorithms in ML – 41 / 48


GAUSSIAN PROCESSES (GP) – PROS & CONS
Advantages Disadvantages
+ GPs allow to quantify prediction uncertainty − GPs are not sparse, i.e., they require the full
induced by both intrinsic noise in the problem and training data for prediction
errors in the parameter estimation process − GP training requires O(n3 ), i.e., it scales cubically
+ A GP is a function interpolator and will predict in the number of observations
the exact value of a training point − GPs cannot handle categorical features.
+ The choice of kernel function allows considerable − GPs are not particularly easy to understand
flexibility for problem specific characteristics conceptually
+ Automatic relevance determination (ARD)
determines the importance of features

LMU SLDS © Important Learning Algorithms in ML – 42 / 48


NEURAL NETWORKS – METHOD SUMMARY
REGRESSION CLASSIFICATION (NON)PARAMETRIC BLACK-BOX

General idea
Learn composite function through series of nonlinear feature transformations, represented as neurons,
organized hierarchically in layers
Basic neuron operation: 1) affine transformation φ (weighted sum of inputs), 2) nonlinear
activation σ
Combinations of simple building blocks to create a complex model
Optimize via mini-batch stochastic gradient descent (SGD) variants:
Gradient of each weight can be infered from the computational graph of the network
→ Automatic Differentiation (AutoDiff)
Algorithm to compute weight updates based on the loss is called Backpropagation
n o
Hypothesis space H = f (x) : f (x) = τ ◦ φ ◦ σ (h) ◦ φ(h) ◦ σ (h−1) ◦ φ(h−1) ◦ ... ◦ σ (1) ◦ φ(1) (x)

LMU SLDS © Important Learning Algorithms in ML – 43 / 48


NEURAL NETWORKS – METHOD SUMMARY
Architecture
Input layer: original features x
Hidden layers: nonlinear transformation of previous layer φ(h) = σ (h−1) (φ(h−1) )
Output layer: number of output neurons and activation depends on problem τ (φ)
Regression: one output neuron, τ = identity
1
Binary classification: one output neuron, τ = 1+exp(−θ > x) (logistic sigmoid)
exp(fj )
Multiclass Classification: g output neurons, τj = Pg
exp(fj )
(softmax)
j =1

Empirical risk In general, compatible with any differentiable loss


Optimization
Variety of different optimizers, mostly based on some form of stochastic gradient descent (SGD)
Improvements:
(1) Accumulation of previous gradients → Momentum
(2) Weight specific scaling based on previous squared gradients → RMSProb
⇒ ADAM combines (1) and (2)
(3) Learning rate schedules, e.g., decaying or cyclical learning rates
Training progress is measured in full passes over the full training data, called epochs
Batch size is a hyperparameter and limited by input data dimension

LMU SLDS © Important Learning Algorithms in ML – 44 / 48


NEURAL NETWORKS – METHOD SUMMARY
Network types Large variety of architectures for different data modelities
Feedforward NNs / multi-layer perceptrons (MLPs): sequence of fully-connected layers ⇒ tabular
data
Convolutional NNs (CNNs): sequence of feature map extractors with spatial awareness ⇒ images, time
series
Recurrent NNs (RNNs): handling of sequential, variable-length information ⇒ times series, text, audio
Transformers: Learning invariances from data, handling multiple/any data modalities

Convolutional network architecture

Recurrent network architecture Transformer network architecture

LMU SLDS © Important Learning Algorithms in ML – 45 / 48


NEURAL NETWORKS – METHOD SUMMARY
Hyperparameters
Architecture:
Lots of design choices ⇒ tuning problem of its own.
Typically: hierachical optimization of components (cells) and macro structure of network
→ Neural Architecture Search (NAS)
Many predifined (well working) architectures exist for standard tasks
Training:
Initial learning rate and various regularization parameters
Number of epochs is determined by early-stopping
Data-augmentation, e.g., applying random rotations to input images

Foundation models
Enormous models trained on vast amounts of (general) data, e.g., all of wikipedia, in self-supervised
fashion
Used as starting point (pre-trained) and fine-tuned via transfer or few-shot learning for other tasks
requiring little data
Examples: GPT-3 for language, CLIP for vision-language, . . .

LMU SLDS © Important Learning Algorithms in ML – 46 / 48


NEURAL NETWORKS – IMPLEMENTATION & PRACTICAL HINTS
General hints
Instead of NAS, use a standard architecture and tune training hyperparameters
Training pipeline (data-augmentation, training schedules, ...) is more crucial than the specific architecture
While NNets are state-of-the-art for computer vision (CV) and natural language processing (NLP), we
recommend not to use them for tabular data because alternatives perform better
Computational efforts for training (and inference) can be very high, requiring specific hardware.
→ Using a service (esp. for foundation models) can be more cost efficient

Implementation
R: Use python libraries (below) via reticulate, but not really recommended except for toy applications.
Python libraries:
keras for simple high level API
PyTorch for flexible design with a focus on research
TensorFlow for flexible design with a focus on deployment / industry
huggingface for pre-trained / foundation models

LMU SLDS © Important Learning Algorithms in ML – 47 / 48


NEURAL NETWORKS – PROS & CONS
Advantages Disadvantages
+ Applicable to complex, nonlinear problems − Typically, high computational cost
+ Very versatile w.r.t. architectures − High demand for training data
+ State-of-the-art for CV and NLP − Strong tendency to overfit
+ Strong performance if done right − Requiring lots of tuning expertise
+ Built-in feature extraction, obtained by − Black-box model – hard to interpret or explain
intermediate representations
+ Easy handling of high-dimensional data
+ Parallelizable training

LMU SLDS © Important Learning Algorithms in ML – 48 / 48

You might also like