0% found this document useful (0 votes)

60 views34 pages

Lec4 Oct12 2022 PracticalNotes LinearRegression

This document provides a summary of a lecture on machine learning systems evaluation and linear regression. It discusses how ML systems can fail due to underfitting, overfitting, or bad data. It also covers evaluating models using train and test accuracy, detecting underfitting and overfitting, the role of hyperparameters, measures of performance like confusion matrices, and challenges with imbalanced data. Finally, it provides an overview of linear regression and its objective to fit a linear function to training data by minimizing empirical risk.

Uploaded by

asasd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views34 pages

Lec4 Oct12 2022 PracticalNotes LinearRegression

Uploaded by

asasd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning 1 (CM50264)

Week 2, Lecture 2: Evaluating ML systems and linear regression

Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines and Xi Chen (for images and other contents)
12 Oct. 2022
University of Bath, UK

1
Evaluating and diagnosing ML systems

How can your machine learning system fail?

• Underfitting
• Overfitting
• Bad data
• ···

How do you detect?

2
Underfitting & overfitting

• Complicated (non-linear) decision boundary

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Underfitting (over-simplistic classifier)

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Overfitting (too complicated classifier)

• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Balanced (does reasonable fitting on training

• Classes overlap data, generalizes well on test data)
• How would you divide them?
3
Underfitting & overfitting

• Underfitting • Balanced • Overfitting

• Logistic regression • Tuned random forest • Badly tuned random forest
(sklearn, (sklearn, with default
min impurity decrease=0.008, parameters)
4
n estimators=512)
Underfitting and overfitting in regression

You can fit a polynomial that exactly fits the training data, but does not generalize.
On the other extreme, your model could be too weak to fit the training data.
5
Underfitting causes

• Weak model
• Bad fitting (left, random forest again)

(e.g., if the optimization was not good

enough)

6
Overfitting causes

• Powerful model → can model training specific noise

+
• Insufficient regularization
Regularization controls model complexity (so that model doesn’t fit noise)

• Maybe insufficient training data?

• How to detect?

7
Train and test set

Accuracy
• Model can’t overfit on data it doesn’t see!
Random Forest Train Test
• Split the data:
• A train set, to fit the model
Underfitting 79.2% 79.2%
• A test set, to verify performance

Balanced 97.6% 95.0%

• Large gap between train/test accuracy
indicates overfitting (usually)
Overfitting 99.6% 94.7%

8
Hyperparameters I

• Parameters → fit to training data

• Hyperparameters → parameters that cannot be fit to training data

• Reasons for having hyperparameters:

• Avoiding overfitting, e.g. decision tree depth
• Controlling the amount of computation, e.g. ensemble size
• Hard to optimize (e.g., which degree of polynomial to fit?)
• Bayesian priors have hyperparameters associated with them

9
Hyperparameters II

• Can still fit hyperparameters . . .

(manually or by algorithm)

• . . . but not to the test set!

(this mistake can be found in countless research papers)

• Introduce a third set: validation set

• train – Give to algorithm
• validation – Objective of hyperparameter optimization
• test – To report final performance

10
Measuring performance

• How do we decide on split percentages?

• Train large → Algorithm performs well
• Validation large → Hyperparameter optimization performs well
• Test large → Accurate performance estimate

• Good default: Validation and test small as possible while maintaining reliable
estimate, rest used to train
• . . . but might shrink train due to computational cost
• “small as possible” hard to judge
11
k-fold cross validation

• Divide train/validation into 7-fold

• train: six parts
• validation: one part
Train for all 7 combinations and report average performance on
test
• Effectively, all samples in the train+val dataset get used for training
and validation: artificially increase the train and val data size
• More robust estimate of model accuracy

• k-fold = k × slower! Typically 4 ≤ k ≤ 10

• Most extreme: Jackknife resampling (leave one out cross-val)

validation sets of size 1; very slow
• In practice mostly not done: time = money
12
Measure of performance: confusion matrix

• Makes sense for classification

only
• Random forest on breast
cancer:
• On diagonal means correct, off means wrong
Actual • Can see which classes are confused
• An empty row is a problem
False True
• May want to color code cells as a heat map!
Predicted

False 49 6

True 14 159

13
Some terminologies

Actual

False True
Predicted

False True Negative (TN) False Negative (FN)

True False Positive (FP) True Positive (TP)

14
Some more terminologies

Loads of terms are used (ignore most of them):

TP
TP+FN sensitivity, recall, hit rate, true positive rate
TN
TN+FP specificity, true negative rate
TP
TP+FP precision, positive predictive value
TP+TN
TP+TN+FP+FN accuracy
2×TP
2×TP+FP+FN F1 score

(many more. . . )

15
Imbalanced data

• Imbalanced training set → Makes training difficult

• Cancer detection in CT scans ≈ 0.1% of scans have cancer
• 99.9% accuracy by predicting no cancer – meaningless!

• Often need to adjust training (e.g. oversampling)

• F1 score is better

• Balanced accuracy:
1 X |{yi = c ∧ fθ (xi ) = c}|
=
|C| |{yi = c}|
c∈C

• C = set of classes, of size |C|

• (yi , xi ) = data points
• fθ (·) = your classifier model
16
Linear regression

17
Linear regression – 1

• Training data: (xi , yi )ni=1 , xi ∈ Rd , yi ∈ R

X d
• Objective: Fit a function f (x) = wj x(j) + b = w⊤ x + b, where w ∈ Rd , b ∈ R
j=1
• We use the same trick that we used in Lecture-1:
" #
1
f (x) = [b w] = w̄⊤ x̄, where w̄, x̄ ∈ Rd+1
x

• We will, however, write f (x) = w⊤ x and understand x ∈ Rd+1 as the augmented

feature vector (with its first element being 1; and w encompassing both the weight
vector and the bias term)
n
1 X ⊤ 2
• Empirical risk minimization (ERM): min w x i − yi
w 2
i=1
18
Linear regression – 2

• A more compact formulation of ERM:

• Let X ∈ Rn×(d+1) be such that its i-th row is xi

 
    w0 = b
−−− x⊤
1 −−− y1  w 
1 
− − − x⊤ − − −  y2 
    
2
 w2 
 
X= ..  , y = .
. , w =
. .  .. 
   
 
 . 
−−− x⊤n − − − n×(d+1) yn n×1
wd (d+1)×1

ERM can be expressed using the following equivalent matrix-vector notations:

1 X
min ∥Xw − y∥22 , where ∥z∥22 = zi2
w 2
i
19
Linear regression: direct solution – 1
n
1 X ⊤ 2 1
Let J(w) = w xi − yi = ∥Xw − y∥22
2 2
i=1
Observations:
n ∂ w⊤ x Xn
∂J(w) X ⊤ i

= w xi − yi = w⊤ xi − yi xit , where t = 0, 1, 2, · · · , d
∂wt ∂wt
i=1 i=1

∂J(w) ⊤
i.e., = X (Xw − y)
∂wt t
 ∂J(w) 
0 ∂w
 ∂J(w) 
 ∂w1 
• ∇J(w) = 
 .. 
 = X ⊤ (Xw − y): called the gradient vector
 . 
∂J(w)
∂wd (d+1)×1
20
Linear regression: direct solution – 2

For s, t = 0, 1, 2, · · · , d:
n n
∂ 2 J(w) ∂ X ⊤ X
= w xi − yi xit = xit xis = X ⊤ X
∂ws wt ∂ws t,s
i=1 i=1

• ∇2 J(w) = X ⊤ X: called the Hessian matrix ((d + 1) × (d + 1))

Exercise: Show that the Hessian matrix is positive semi-definite (PSD), i.e., for any
u ∈ R(d+1)×(d+1) , the quadratic form of the Hessian u⊤ ∇2 J(w) u ≥ 0

• A function that is twice differentiable and has a PSD Hessian is called a convex
function.
• For a convex J(w), the solution to ∇J(w) = 0 minimizes the function.

21
Linear regression: direct solution – 3

Therefore, to find the optimal w (which we will denote by w∗ ), we need to solve

∇J(w)w=w∗ = X ⊤ (Xw∗ − y) = 0.

X ⊤ (Xw∗ − y) = 0 : called the normal equation

=⇒ X ⊤ X w∗ = X ⊤ y

This can be solved directly if X ⊤ X is invertible, and the solution is given by

−1
w∗ = X ⊤ X X ⊤y = X †y ,

−1
where X † = X ⊤ X X ⊤ is called the pseudo-inverse of X.

• For the normal matrix X ⊤ X ∈ R(d+1)×(d+1) to be invertible, we need n ≥ (d + 1)

• This condition is necessary for the existence of inverse, not sufficient 22
Why ‘normal’ equation?

• Consider the range space of X, given by {v ∈ Rn : v = Xw}

• The algorithm seeks to project y onto this space

v ⊤ (Xw∗ − y) = (Xw)⊤ (Xw∗ − y) = w⊤ X ⊤ (Xw∗ − y) = 0

| {z }
=0

That is, the residual error e = Xw∗ − y is orthogonal (normal) to the range
space of X, hence the name! 23
The direct solution may not always be available

• X ⊤ X might not be invertible always (possible reasons include n < (d + 1),

and/or linearly dependent feature vectors)

• A direct solution may not even exist!

• Matrix inversion (or computing the pseudo-inverse) is generally an expensive
operation. Could be infeasible for large d and/or n

Can we approximate the solution iteratively?

24
Iterative solution

A very simple iterative algorithm is gradient-descent (GD).

w(k + 1) = w(k) − ηk ∇J(w(k))

= w(k) − ηk X ⊤ (Xw(k) − y)

• For ηk small enough, GD converges to a global minima for convex J (and a

stationary point for non-convex J), subject to some additional technical
requirements
• You will learn more about GD in Xi’s lectures on optimization

25
The intuition behind gradient-descent

How GD worksa

a
image source: mathworks 26
The least mean square (LMS) algorithm

• Basically applies gradient descent, but only on one randomly chosen sample
instead of the whole dataset
LMS algorithm (also known as Widrow-Hoff algo., or just stochastic gradient descent)
• init k ← 0, w ← w(0)
while not converged
– Sample randomly a data point (xk , yk )
– Do weight update: w ← w − ηk x⊤

k w − yk x k
return w
Result: The LMS algorithm recovers a solution of the normal equation if the step-sizes
are chosen appropriately.
Advantage: The computational complexity of every update step is n times smaller
than the batch version.
27
Generalized linear regression (GLR)

• Let ϕ : Rd 7→ RD be a feature transformation, where D > d

D
X
• Let our model be f (x) = wj ϕ(j) (x) + b = w⊤ ϕ + b
j=1
• Using the augmentation trick, we will write f (x) = w⊤ ϕ, where w, ϕ ∈ RD+1
 

⊤
   w0 = b
− − − ϕ(x1 ) −−− y1  w 
⊤ 1 
− − − ϕ(x2 ) − − −  y2 
    
 
Φ= .. ,y =   ..  ,w =  w2 
  
. .  .. 
 
 
 . 
− − − ϕ(xn )⊤ − − − n×(D+1) yn n×1
wD (D+1)×1

1 −1
• ERM: min ∥Φw − y∥22 =⇒ w∗ = Φ⊤ Φ Φ⊤ y
w 2
28
GLR can learn powerful models – 1
" #
x(1)
• Consider binary classification in 2D, x = (2)
x
• A simple linear model f (x) = w x will only allow you to learn lines for fitting the
⊤

data
• Using GLR, you can fit a polynomial, for instance (and more complicated
functions!)
 
1
 x(1) 
 
 x(2) 
 
ϕ : x 7→ ϕ(x) =  
x(1) · x(2) 
 2 
 x(1) 
 
2
x(2)
29
GLR can learn powerful models – 2

• Interestingly, the feature map ϕ can even transform the feature to an

infinite-dimensional space (without actually ever computing ϕ explicitly)
• Leads to the so-called kernel linear regression algorithm (more about it later in
the course)
• The idea is to have a kernel function K such that K(x1 , x2 ) = ϕ(x1 )⊤ ϕ(x2 ), for
any x1 and x2
• K must be efficiently computable (bypasses the task of inner-product in a very
high-dimensional space)
• More about kernels in the context of SVMs
• Even state-of-the-art deep neural networks can be approximated using kernel
linear regression (neural tangent kernels)

30
Easy extension to vector-valued targets

• Target variable y ∈ Rm , m > 1: fit m linear models for m dimensions

1 −1
min ∥XW − Y ∥22 =⇒ w∗ = X ⊤ X X ⊤Y
w 2
   
−−− x⊤1 −−− −−− y1⊤ −−−
− − − x⊤ − − − − − − y2⊤ − − −
   
2
X= ..  ,Y = 
 ..

.  .
  
  
−−− x⊤
n − − − n×(d+1) −−− yn⊤ − − − n×m

 (1) (2) (m) 

w0 w0 ··· w0
 (1) (2) (m) 
w1 w1 ··· w1 
W =
 .. .. .. .. 
.

 . . . 
(1) (2) (m)
wd wd ··· wd (d+1)×m
31

EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
12 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Regression
No ratings yet
Regression
56 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Data Science Distributions & Models
50% (2)
Data Science Distributions & Models
5 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Statistical Machine Learning Exam Guide
No ratings yet
Statistical Machine Learning Exam Guide
10 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Regression
No ratings yet
Regression
39 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
ML 01
No ratings yet
ML 01
24 pages
Machine Learning Course Notes
No ratings yet
Machine Learning Course Notes
112 pages
机器学习
No ratings yet
机器学习
41 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
No ratings yet
Undergraduate Fundamentals of Machine Learning Author William J. Deuschle
143 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Advanced Regression Pres
No ratings yet
Advanced Regression Pres
42 pages
Lecture Slide 02 - Supervised Learning-1
No ratings yet
Lecture Slide 02 - Supervised Learning-1
43 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML 3
No ratings yet
ML 3
50 pages
Lecture 2 2022
No ratings yet
Lecture 2 2022
34 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Lect 1
No ratings yet
Lect 1
24 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
38 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
System Log for Developers
No ratings yet
System Log for Developers
117 pages
RISC-V Assembler Cheat Sheet - Project F
No ratings yet
RISC-V Assembler Cheat Sheet - Project F
11 pages
Smart Locks Installation and Support Walnut in Creek
No ratings yet
Smart Locks Installation and Support Walnut in Creek
3 pages
Dell Powervault Me5 Adapt Software WP
No ratings yet
Dell Powervault Me5 Adapt Software WP
13 pages
WordPress SEO
80% (10)
WordPress SEO
353 pages
Fall 2023 BIS 109 Syllabus
No ratings yet
Fall 2023 BIS 109 Syllabus
6 pages
OnDevice FTP - User Manual
No ratings yet
OnDevice FTP - User Manual
19 pages
SIP5 7UT82-85-86-87 V08.40 Manual C016-D en
No ratings yet
SIP5 7UT82-85-86-87 V08.40 Manual C016-D en
2,330 pages
414 Mid 2 QP
No ratings yet
414 Mid 2 QP
1 page
Python Operators and Control Flow Guide
No ratings yet
Python Operators and Control Flow Guide
28 pages
Common DAX Interview Questions
No ratings yet
Common DAX Interview Questions
7 pages
Teamcenter Developer Resume Summary
No ratings yet
Teamcenter Developer Resume Summary
6 pages
Web Programming Assignment-4
No ratings yet
Web Programming Assignment-4
4 pages
Using Humanoid Characters
No ratings yet
Using Humanoid Characters
22 pages
Sic 2250-Web Design and Architecture-It Y2s1 and Bbit Y2s1
No ratings yet
Sic 2250-Web Design and Architecture-It Y2s1 and Bbit Y2s1
15 pages
Information Assurance Structure Guide
No ratings yet
Information Assurance Structure Guide
25 pages
Revesion Tour I Worksheet (26.10.24)
No ratings yet
Revesion Tour I Worksheet (26.10.24)
10 pages
Exercises Topic 3. Part 1
No ratings yet
Exercises Topic 3. Part 1
8 pages
AI - Midterm - Model Answer - Model2
No ratings yet
AI - Midterm - Model Answer - Model2
4 pages
Cinderella: and Other Stories
No ratings yet
Cinderella: and Other Stories
16 pages
AT-S114 V2.0.2 (1.00.035) AT-GS950/16 Gigabit Ethernet Smart Switch Software Release Notes
No ratings yet
AT-S114 V2.0.2 (1.00.035) AT-GS950/16 Gigabit Ethernet Smart Switch Software Release Notes
4 pages
DBMS Interview Questions Guide
No ratings yet
DBMS Interview Questions Guide
25 pages
Dissertation Help for PhD Students
100% (2)
Dissertation Help for PhD Students
7 pages
ChatGPT for Aspect-Based Sentiment Analysis
No ratings yet
ChatGPT for Aspect-Based Sentiment Analysis
16 pages
Ccna Exploration Accessing The Wan Study Guide: Chapter 4: Network Security
No ratings yet
Ccna Exploration Accessing The Wan Study Guide: Chapter 4: Network Security
8 pages
Anusha Katta Resume
No ratings yet
Anusha Katta Resume
2 pages
Op-Amp Solutions for Engineers
No ratings yet
Op-Amp Solutions for Engineers
54 pages
Dating App Hacks for Men
No ratings yet
Dating App Hacks for Men
3 pages
OSI Model: Source to Destination Delivery
No ratings yet
OSI Model: Source to Destination Delivery
16 pages
1VLG100520 - Installation Guide of IR Temperature Sensors - Revb
No ratings yet
1VLG100520 - Installation Guide of IR Temperature Sensors - Revb
46 pages