Machine Learning 1 (CM50264)
Week 2, Lecture 2: Evaluating ML systems and linear regression
Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines and Xi Chen (for images and other contents)
12 Oct. 2022
University of Bath, UK
1
Evaluating and diagnosing ML systems
How can your machine learning system fail?
• Underfitting
• Overfitting
• Bad data
• ···
How do you detect?
2
Underfitting & overfitting
• Complicated (non-linear) decision boundary
• Classes overlap
• How would you divide them?
3
Underfitting & overfitting
• Complicated (non-linear) decision boundary • Underfitting (over-simplistic classifier)
• Classes overlap
• How would you divide them?
3
Underfitting & overfitting
• Complicated (non-linear) decision boundary • Overfitting (too complicated classifier)
• Classes overlap
• How would you divide them?
3
Underfitting & overfitting
• Complicated (non-linear) decision boundary • Balanced (does reasonable fitting on training
• Classes overlap data, generalizes well on test data)
• How would you divide them?
3
Underfitting & overfitting
• Underfitting • Balanced • Overfitting
• Logistic regression • Tuned random forest • Badly tuned random forest
(sklearn, (sklearn, with default
min impurity decrease=0.008, parameters)
4
n estimators=512)
Underfitting and overfitting in regression
You can fit a polynomial that exactly fits the training data, but does not generalize.
On the other extreme, your model could be too weak to fit the training data.
5
Underfitting causes
• Weak model
• Bad fitting (left, random forest again)
(e.g., if the optimization was not good
enough)
6
Overfitting causes
• Powerful model → can model training specific noise
+
• Insufficient regularization
Regularization controls model complexity (so that model doesn’t fit noise)
• Maybe insufficient training data?
• How to detect?
7
Train and test set
Accuracy
• Model can’t overfit on data it doesn’t see!
Random Forest Train Test
• Split the data:
• A train set, to fit the model
Underfitting 79.2% 79.2%
• A test set, to verify performance
Balanced 97.6% 95.0%
• Large gap between train/test accuracy
indicates overfitting (usually)
Overfitting 99.6% 94.7%
8
Hyperparameters I
• Parameters → fit to training data
• Hyperparameters → parameters that cannot be fit to training data
• Reasons for having hyperparameters:
• Avoiding overfitting, e.g. decision tree depth
• Controlling the amount of computation, e.g. ensemble size
• Hard to optimize (e.g., which degree of polynomial to fit?)
• Bayesian priors have hyperparameters associated with them
9
Hyperparameters II
• Can still fit hyperparameters . . .
(manually or by algorithm)
• . . . but not to the test set!
(this mistake can be found in countless research papers)
• Introduce a third set: validation set
• train – Give to algorithm
• validation – Objective of hyperparameter optimization
• test – To report final performance
10
Measuring performance
• How do we decide on split percentages?
• Train large → Algorithm performs well
• Validation large → Hyperparameter optimization performs well
• Test large → Accurate performance estimate
• Good default: Validation and test small as possible while maintaining reliable
estimate, rest used to train
• . . . but might shrink train due to computational cost
• “small as possible” hard to judge
11
k-fold cross validation
• Divide train/validation into 7-fold
• train: six parts
• validation: one part
Train for all 7 combinations and report average performance on
test
• Effectively, all samples in the train+val dataset get used for training
and validation: artificially increase the train and val data size
• More robust estimate of model accuracy
• k-fold = k × slower! Typically 4 ≤ k ≤ 10
• Most extreme: Jackknife resampling (leave one out cross-val)
validation sets of size 1; very slow
• In practice mostly not done: time = money
12
Measure of performance: confusion matrix
• Makes sense for classification
only
• Random forest on breast
cancer:
• On diagonal means correct, off means wrong
Actual • Can see which classes are confused
• An empty row is a problem
False True
• May want to color code cells as a heat map!
Predicted
False 49 6
True 14 159
13
Some terminologies
Actual
False True
Predicted
False True Negative (TN) False Negative (FN)
True False Positive (FP) True Positive (TP)
14
Some more terminologies
Loads of terms are used (ignore most of them):
TP
TP+FN sensitivity, recall, hit rate, true positive rate
TN
TN+FP specificity, true negative rate
TP
TP+FP precision, positive predictive value
TP+TN
TP+TN+FP+FN accuracy
2×TP
2×TP+FP+FN F1 score
(many more. . . )
15
Imbalanced data
• Imbalanced training set → Makes training difficult
• Cancer detection in CT scans ≈ 0.1% of scans have cancer
• 99.9% accuracy by predicting no cancer – meaningless!
• Often need to adjust training (e.g. oversampling)
• F1 score is better
• Balanced accuracy:
1 X |{yi = c ∧ fθ (xi ) = c}|
=
|C| |{yi = c}|
c∈C
• C = set of classes, of size |C|
• (yi , xi ) = data points
• fθ (·) = your classifier model
16
Linear regression
17
Linear regression – 1
• Training data: (xi , yi )ni=1 , xi ∈ Rd , yi ∈ R
X d
• Objective: Fit a function f (x) = wj x(j) + b = w⊤ x + b, where w ∈ Rd , b ∈ R
j=1
• We use the same trick that we used in Lecture-1:
" #
1
f (x) = [b w] = w̄⊤ x̄, where w̄, x̄ ∈ Rd+1
x
• We will, however, write f (x) = w⊤ x and understand x ∈ Rd+1 as the augmented
feature vector (with its first element being 1; and w encompassing both the weight
vector and the bias term)
n
1 X ⊤ 2
• Empirical risk minimization (ERM): min w x i − yi
w 2
i=1
18
Linear regression – 2
• A more compact formulation of ERM:
• Let X ∈ Rn×(d+1) be such that its i-th row is xi
w0 = b
−−− x⊤
1 −−− y1 w
1
− − − x⊤ − − − y2
2
w2
X= .. , y = .
. , w =
. . ..
.
−−− x⊤n − − − n×(d+1) yn n×1
wd (d+1)×1
ERM can be expressed using the following equivalent matrix-vector notations:
1 X
min ∥Xw − y∥22 , where ∥z∥22 = zi2
w 2
i
19
Linear regression: direct solution – 1
n
1 X ⊤ 2 1
Let J(w) = w xi − yi = ∥Xw − y∥22
2 2
i=1
Observations:
n ∂ w⊤ x Xn
∂J(w) X ⊤ i
= w xi − yi = w⊤ xi − yi xit , where t = 0, 1, 2, · · · , d
∂wt ∂wt
i=1 i=1
∂J(w) ⊤
i.e., = X (Xw − y)
∂wt t
∂J(w)
0 ∂w
∂J(w)
∂w1
• ∇J(w) =
..
= X ⊤ (Xw − y): called the gradient vector
.
∂J(w)
∂wd (d+1)×1
20
Linear regression: direct solution – 2
For s, t = 0, 1, 2, · · · , d:
n n
∂ 2 J(w) ∂ X ⊤ X
= w xi − yi xit = xit xis = X ⊤ X
∂ws wt ∂ws t,s
i=1 i=1
• ∇2 J(w) = X ⊤ X: called the Hessian matrix ((d + 1) × (d + 1))
Exercise: Show that the Hessian matrix is positive semi-definite (PSD), i.e., for any
u ∈ R(d+1)×(d+1) , the quadratic form of the Hessian u⊤ ∇2 J(w) u ≥ 0
• A function that is twice differentiable and has a PSD Hessian is called a convex
function.
• For a convex J(w), the solution to ∇J(w) = 0 minimizes the function.
21
Linear regression: direct solution – 3
Therefore, to find the optimal w (which we will denote by w∗ ), we need to solve
∇J(w)w=w∗ = X ⊤ (Xw∗ − y) = 0.
X ⊤ (Xw∗ − y) = 0 : called the normal equation
=⇒ X ⊤ X w∗ = X ⊤ y
This can be solved directly if X ⊤ X is invertible, and the solution is given by
−1
w∗ = X ⊤ X X ⊤y = X †y ,
−1
where X † = X ⊤ X X ⊤ is called the pseudo-inverse of X.
• For the normal matrix X ⊤ X ∈ R(d+1)×(d+1) to be invertible, we need n ≥ (d + 1)
• This condition is necessary for the existence of inverse, not sufficient 22
Why ‘normal’ equation?
• Consider the range space of X, given by {v ∈ Rn : v = Xw}
• The algorithm seeks to project y onto this space
v ⊤ (Xw∗ − y) = (Xw)⊤ (Xw∗ − y) = w⊤ X ⊤ (Xw∗ − y) = 0
| {z }
=0
That is, the residual error e = Xw∗ − y is orthogonal (normal) to the range
space of X, hence the name! 23
The direct solution may not always be available
• X ⊤ X might not be invertible always (possible reasons include n < (d + 1),
and/or linearly dependent feature vectors)
• A direct solution may not even exist!
• Matrix inversion (or computing the pseudo-inverse) is generally an expensive
operation. Could be infeasible for large d and/or n
Can we approximate the solution iteratively?
24
Iterative solution
A very simple iterative algorithm is gradient-descent (GD).
w(k + 1) = w(k) − ηk ∇J(w(k))
= w(k) − ηk X ⊤ (Xw(k) − y)
• For ηk small enough, GD converges to a global minima for convex J (and a
stationary point for non-convex J), subject to some additional technical
requirements
• You will learn more about GD in Xi’s lectures on optimization
25
The intuition behind gradient-descent
How GD worksa
a
image source: mathworks 26
The least mean square (LMS) algorithm
• Basically applies gradient descent, but only on one randomly chosen sample
instead of the whole dataset
LMS algorithm (also known as Widrow-Hoff algo., or just stochastic gradient descent)
• init k ← 0, w ← w(0)
while not converged
– Sample randomly a data point (xk , yk )
– Do weight update: w ← w − ηk x⊤
k w − yk x k
return w
Result: The LMS algorithm recovers a solution of the normal equation if the step-sizes
are chosen appropriately.
Advantage: The computational complexity of every update step is n times smaller
than the batch version.
27
Generalized linear regression (GLR)
• Let ϕ : Rd 7→ RD be a feature transformation, where D > d
D
X
• Let our model be f (x) = wj ϕ(j) (x) + b = w⊤ ϕ + b
j=1
• Using the augmentation trick, we will write f (x) = w⊤ ϕ, where w, ϕ ∈ RD+1
⊤
w0 = b
− − − ϕ(x1 ) −−− y1 w
⊤ 1
− − − ϕ(x2 ) − − − y2
Φ= .. ,y = .. ,w = w2
. . ..
.
− − − ϕ(xn )⊤ − − − n×(D+1) yn n×1
wD (D+1)×1
1 −1
• ERM: min ∥Φw − y∥22 =⇒ w∗ = Φ⊤ Φ Φ⊤ y
w 2
28
GLR can learn powerful models – 1
" #
x(1)
• Consider binary classification in 2D, x = (2)
x
• A simple linear model f (x) = w x will only allow you to learn lines for fitting the
⊤
data
• Using GLR, you can fit a polynomial, for instance (and more complicated
functions!)
1
x(1)
x(2)
ϕ : x 7→ ϕ(x) =
x(1) · x(2)
2
x(1)
2
x(2)
29
GLR can learn powerful models – 2
• Interestingly, the feature map ϕ can even transform the feature to an
infinite-dimensional space (without actually ever computing ϕ explicitly)
• Leads to the so-called kernel linear regression algorithm (more about it later in
the course)
• The idea is to have a kernel function K such that K(x1 , x2 ) = ϕ(x1 )⊤ ϕ(x2 ), for
any x1 and x2
• K must be efficiently computable (bypasses the task of inner-product in a very
high-dimensional space)
• More about kernels in the context of SVMs
• Even state-of-the-art deep neural networks can be approximated using kernel
linear regression (neural tangent kernels)
30
Easy extension to vector-valued targets
• Target variable y ∈ Rm , m > 1: fit m linear models for m dimensions
1 −1
min ∥XW − Y ∥22 =⇒ w∗ = X ⊤ X X ⊤Y
w 2
−−− x⊤1 −−− −−− y1⊤ −−−
− − − x⊤ − − − − − − y2⊤ − − −
2
X= .. ,Y =
..
. .
−−− x⊤
n − − − n×(d+1) −−− yn⊤ − − − n×m
(1) (2) (m)
w0 w0 ··· w0
(1) (2) (m)
w1 w1 ··· w1
W =
.. .. .. ..
.
. . .
(1) (2) (m)
wd wd ··· wd (d+1)×m
31