Slides Learners
Slides Learners
Learning goals
General idea of important ML algorithms
Overview of strengths and weaknesses
CONTENTS
1 k -Nearest Neighbors (k -NN)
5 Random Forests
6 Gradient Boosting
General idea
similarity in feature space (w.r.t. certain distance metric d (x(i ) , x)) similarity in target space
Prediction for x: construct k -neighborhood Nk (x) from k points closest to x in X , then predict
P1 wi y (i ) with wi = 1
P
(weighted) mean target for regression: ŷ = wi d (x(i ) ,x)
i :x(i ) ∈Nk (x) i :x(i ) ∈Nk (x)
[Link]
[Link]
3.5 3.5 3.5 Classification
Left: Neighborhood for exemplary
3.0 3.0 3.0 observation in iris, k = 50
2.5 2.5 2.5 Middle: Prediction surface for k = 1
Right: Prediction surface for k = 50
2.0 2.0 2.0
5 6 7 8 5 6 7 8 5 6 7 8
[Link] [Link] [Link]
Regression
Left: Prediction surface for k = 3
Middle: Prediction surface for k = 7
Right: Prediction surface for k = 15
Implementation
R: mlr3 learners (calling kknn::kknn())
Classification:
- LearnerClassifKKNN
- fnn::knn()
Regression:
- LearnerRegrKKNN
- fnn::[Link]()
Nearest Neighbour Search in O(N log N ): RANN::nn2()
Python: From package [Link]
Classification:
- KNeighborsClassifier()
- RadiusNeighborsClassifier() as alternative if data not uniformly sampled
Regression:
- KNeighborsRegressor()
- RadiusNeighborsRegressor() as alternative if data not uniformly sampled
General idea Represent target as function of linear predictor θ > x (weighted sum of features)
→ Interpretation: if feature xj increases by 1 unit, the linear predictor changes by θj units
H = f : X → R | f (x) = φ(θ > x) , with suitable transformation φ(·), e.g.,
Hypothesis space
Linear Regression: Y = R, φ identity
Logistic Regression: Y = {0, 1}, logistic sigmoid φ(θ > x) = 1
1+exp(−θ > x)
=: π(x | θ)
⇒ Decision rule: Linear hyperplane
1.00
0.75
s(f)
0.50
0.25
0.00
−10 −5 0 5 10
Logistic function for bivariate input and
loss-minimal θ
f
General idea
Same as GLM, but introduce flexibility through nonlinear (smooth) effects fj (xj )
Typically, combination of linear & smooth effects
Smooth effects also conceivable for feature interactions
( !)
p
P
Hypothesis space H= f : X → R | f (x) = φ θ0 + fj (xj ) , with suitable transformation φ(·),
j =1
100
0
partial effect
−100
−200
−300
Prediction of bike rentals from smooth term of humidity (left: partial effect) and linear term of temperature (right: bivariate prediction).
weighted B−splines
0.6 50
B−splines
0
0.4 0
0.0 −100
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x humidity humidity
Left: B-spline basis with 9 basis functions. Middle: BFs weighted with coefficients estimated for humidity. Right: sum of weighted BFs in black (= partial effect).
Optimization
Coefficients (of smooth + linear terms): penalized MLE, Bayesian inference
Smoothing hyperparameters: typically, generalized cross-validation
Advantages Disadvantages
+ Simple and fast implementation − Sensitivity w.r.t. outliers and noisy data
+ Applicable for any dataset size, as long as − Feature interactions must be handcrafted
number of observations number of features → practically infeasible for higher orders
+ High flexibility via smooth effects − Harder to optimize than GLM
+ Easy to combine linear & nonlinear effects − Additional hyperparameters (type of smooth
+ Intuitive interpretability via feature effects functions, smoothness degree, . . . )
(though not quite as straightforward as in GLM)
+ Statistical hypothesis tests for effects available
Q1 , . . . , QM Iris Data
[Link]
1.5
setosa
Find optimal split (feature-threshold combination) versicolor
1.0 virginica
→ greedy search
Assign constant prediction cm to all obs. in Qm 0.5
y
c
c2
Compute empirical risks in child nodes and minimize their sum to find best split (impurity reduction):
arg minj ,t R(N , j , t ) = arg minj ,t R(N1 ) + R(N2 )
|N |
Note: If R is the average instead of the sum of loss functions, we need to reweight: |Nt | R(Nt )
In general, compatible with arbitrary losses – typical choices:
g-way classification:
Brier score → Gini impurity Bernoulli loss → entropy impurity
g g
πˆk (N ) (1 − πˆk (N ) ) πˆk (N ) log πˆk (N )
P P P P
R(N ) = R(N ) = −
(x,y )∈N k =1 (x,y )∈N k =1
(y − c )2 with c = |N1 |
P P
Regression (quadratic loss): R(N ) = y
(x,y )∈N (x,y )∈N
Optimization
Exhaustive search over all split candidates, choice of risk-minimal split
In practice: reduce number of split candidates (e.g., using quantiles instead of all observed values)
LMU SLDS © Important Learning Algorithms in ML – 19 / 48
CART – IMPLEMENTATION & PRACTICAL HINTS
Hyperparameters and complexity control
Unless interrupted, splitting continues until we have pure leaf nodes (costly + overfitting)
Hyperparameters: Complexity (i.e., number of terminal nodes) controlled via tree depth, minimum number
of observations per node, maximum number of leaves, minimum risk reduction per split, ...
Limit tree growth / complexity via
Early stopping: stop growth prematurely
→ hard to determine good stopping point before actually trying all combinations
Pruning: grow deep trees and cut back in risk-optimal manner afterwards
Implementations
R:
CART: mlr3 learners LearnerClassifRpart / LearnerRegrRpart, calling rpart::rpart()
Conditional inference trees: partykit::ctree()
mitigates overfitting by controlling tree size via p-value-based splitting
Model-based recursive partitioning: partykit::mob()
fits a linear model within each terminal node of the decision tree
Rule-based models: Cubist::cubist() for regression and C50::C5.0() for classification; more
flexible frameworks for fitting various types of models (e.g., GLMs) within a tree’s terminal nodes
Python: DecisionTreeClassifier / DecisionTreeRegressor from package scikit-learn
Advantages Disadvantages
+ Easy to understand & visualize (interpretable) − Rather poor generalization
+ Built-in feature selection − High variance/instability: model can change a
→ e.g., when features are not used for splitting lot when training data is minimally changed
+ Applicable to categorical features − Can overfit if tree is grown too deep
→ e.g., 2m possible binary splits for m categories − Not well-suited to model linear relationships
→ trick for regr. with L2-loss and binary classif.: − Bias toward features with many unique values or
categories can be sorted ⇒ m − 1 binary splits categories
+ Handling of missings possible via surrogate splits
+ Models interactions, even of higher order
+ Fast computation and good scalability
+ High flexibility with custom split criteria or
leaf-node prediction rules
General idea
Bagging ensemble of M tree base learners fitted on bootstrap data samples
⇒ Reduce variance by ensembling while slightly increasing bias by bootstrapping
Use unstable, high-variance base learners by letting trees grow to full size
Promoting decorrelation by random subset of candidate features for each split
Predict via averaging (regression) or majority vote (classification) of base learners
( [m]
)
M TP
1
P [m] [m]
Hypothesis space H= f (x) : f (x) = M
ct I(x ∈ Qt )
m=1 t =1
500 Tree(s) for Iris Dataset
4.5
4.0
3.5
Response
[Link]
setosa
versicolor
3.0 virginica
2.5
2.0
5 6 7 8
[Link]
Schematic depiction of bagging process Prediction surface for iris data with 500-tree ensemble
Feature importance
Based on improvement in split criterion: aggregate improvements by all splits using j-th feature
Based on permutation: permute j-th feature in OOB observations and compute impact on OOB error
Hyperparameters
Ensemble size, i.e., number of trees
Complexity of base learners, e.g., tree depth, min-split, min-leaf-size
Number of split candidates, i.e., number of features to be considered at each split
√
⇒ frequently used heuristics with total of p features: b pc for classification, bp/3c for regression
Tuning
Ensemble size should not be tuned as it only decreases variance −→ choose sufficiently large ensemble
While default values for number of split points is often good, tuning it can still improve performance
Tuning the minimum samples in leafs and minimum samples for splitting can be benificial but no huge
performance increases are to be expected
Implementation
R: mlr3 learners LearnerClassifRanger / LearnerRegrRanger, calling ranger::ranger() as a
highly efficient and flexible implementation
Python: RandomForestClassifier / RandomForestRegressor from package scikit-learn
General idea
Sequential ensemble of M base learners by greedy forward stagewise additive modeling
In each iteration a base learner is fitted to current pseudo residuals ⇒ one boosting iteration is
one approximate gradient step in function space
Base learners are typically trees, linear regressions or splines
Predict via (weighted) sum of base learners
n o
H = f (x) : f (x) = M [m]
b(x, θ [m] )
P
Hypothesis space m =1 β
Boosting prediction function with GAM base learners for univariate Boosting prediction surface with tree base learners for iris data after 100
regression problem after 10 iterations iterations (right: contour lines of discriminant functions)
Optimization
Same optimization procedure as base learner, while keeping the current ensemble f̂ [m−1] fixed
⇒ Efficient and generally applicable since inner loss is always L2
[m]
β [m] is found via line search or fixed to a small constant value and combined with the leaf values ct for
[m] [m ]
tree base learners: c̃t = β [m] · ct
Hyperparameters
Ensemble size, i.e., number of base learners
Complexity of base learners (depending on type used)
Learning rate β , i.e., impact of next base learner
Tuning
Use early-stopping to determine ensemble size
Various regularization parameters, e.g., L1/L2, number of leaves, ... that need to be carefully tuned
Tune learning rate and base learner complexity hyperparameters on log-scale
Dual problem SVMs as a constraint optimization (primal) problem (maximize margin s.t. constraints on obs.
to limit margin violations) can be formulated as a Lagrangian dual problem with Lagrange multipliers αi ≥ 0:
n n n n
1 XX
X D E X
maxn αi − αi αj y (i ) y (j ) x(i ) , x(j ) s.t. 0 ≤ αi ≤ C ∀i ∈ {1, . . . , n} and αi y (i ) = 0
α∈R 2
i =1 i =1 j =1 i =1
Hyperparameters Cost parameter C to control maximization of the margin vs. minimizing margin violations
Hard-margin SVM: margin is maximized by boundary on the right Soft-margin SVM: large margin and few margin violations on the right (best trade-off)
Tuning
0 0 .5 1 1.5 2 2 .5 3 3.5 4 4.5
Tuning of cost parameter C advisable
log (100)
⇒ strong influence on resulting hyperplane
C it is often tuned on a log-scale grid for optimal and
space-filling search space 1 20 55 90 100
Implementation
R: mlr3 learners LearnerClassifSVM / LearnerRegrSVM, calling e1071::svm() with linear kernel
(libSVM interface). Further implementations in mlr3extralearners based on
kernlab::ksvm() allowing custom kernels
LiblineaR::LiblineaR() for a fast implementation with linear kernel
Python: [Link] from package scikit-learn / package libSVM
General idea
Move beyond linearity by mapping data to transformed space where they are linearly separable
Kernel trick
No need for explicit construction of feature maps
Replace inner product of feature map φ : X → Φ by kernel: hφ(x), φ(x̃)i = k (x, x̃)
Hypothesis space
H = f (x) : f (x) = sign θ > φ(x) + θ0 (primal)
n n
P (i ) (i ) P (i )
H= f (x) : f (x) = sign αi y k (x , x) + θ0 | αi ≥ 0, αi y =0 (dual)
i =1 i =1
Nonlinear problem in original space Mapping to 3D space and subsequent linear separation – implicitly handled by kernel in nonlinear SVM
Hyperparameters Cost C of margin violations, kernel hyperparameters (e.g., width of RBF kernel)
Tuning
High sensitivity w.r.t. hyperparameters, especially those of kernel ⇒ tuning very important
For RBF kernels, use RBF sigma heuristic to determine bandwidth
Implementation
R: mlr3 learners LearnerClassifSVM / LearnerRegrSVM, calling e1071::svm() with nonlinear kernel
(libSVM interface), kernlab::ksvm() allowing custom kernels
Python: [Link] from package scikit-learn / package libSVM
General idea
GPs model a distribution over potential functions f that fit the observed data
Assumptions:
n-observations follow a n-dimensional Normal distribution
The closer observations are, the higher they are correlated
A kernel function k (x(i ) , x(j ) ) quantifies the similarity between two observations and induces the
covariance matrix of the distribution.
Predict via the maximum a-posteriori (MAP) estimate.
n h i o
Hypothesis space H = f = f x(1) , . . . , f x(n) ∼ N (m, K ) | m ∈ Rn , K ∈ Rn×n
Functions drawn from a Gaussian process prior Posterior process after 4 observations
2 2
f(x)
f(x)
0 0
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
x x
Noisy GPs
Having an interpolator might not be suitable if the data is noisy
A noisy GP adds a nugget effect to the kernel k (x(i ) , x(j ) ) + σδij , creating a Gaussian process regression
model
Implementation
R: mlr3 learners LearnerClassifGausspr / LearnerRegrGausspr, calling kernlab::gausspr()
Python: GaussianProcessClassifier / GaussianProcessRegressor from package
scikit-learn, gpytorch for a modular, scalable, efficient and GPU accelerated implementation built on
torch
General idea
Learn composite function through series of nonlinear feature transformations, represented as neurons,
organized hierarchically in layers
Basic neuron operation: 1) affine transformation φ (weighted sum of inputs), 2) nonlinear
activation σ
Combinations of simple building blocks to create a complex model
Optimize via mini-batch stochastic gradient descent (SGD) variants:
Gradient of each weight can be infered from the computational graph of the network
→ Automatic Differentiation (AutoDiff)
Algorithm to compute weight updates based on the loss is called Backpropagation
n o
Hypothesis space H = f (x) : f (x) = τ ◦ φ ◦ σ (h) ◦ φ(h) ◦ σ (h−1) ◦ φ(h−1) ◦ ... ◦ σ (1) ◦ φ(1) (x)
Foundation models
Enormous models trained on vast amounts of (general) data, e.g., all of wikipedia, in self-supervised
fashion
Used as starting point (pre-trained) and fine-tuned via transfer or few-shot learning for other tasks
requiring little data
Examples: GPT-3 for language, CLIP for vision-language, . . .
Implementation
R: Use python libraries (below) via reticulate, but not really recommended except for toy applications.
Python libraries:
keras for simple high level API
PyTorch for flexible design with a focus on research
TensorFlow for flexible design with a focus on deployment / industry
huggingface for pre-trained / foundation models