ECE553/653 Neural Networks
Linear Regression
1
Previously …
• Artificial intelligence (AI) vs. machine learning
(ML) vs. deep learning (DL)
AI
ML
DL
2
Artificial Intelligence
• AI segregates the fruits using decision logic
within a rule-based engine
3
Machine Learning
• ML segregates the fruits using feature
extraction and provides a degree of confidence
4
Deep Learning
• DL-based method removes the need for
defining what each fruit looks like.
5
Machine Learning vs. Deep Learning
Machine Learning Deep Learning
Performs well on small Performs well on large
Data
to medium datasets datasets
Requires significant
Hardware Able to function on CPU
computing power e.g., GPU
Features need to be
Features Learns features automatically
manually identified
Training time Quick to train Computationally intensive
6
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree
• Unsupervised learning
– Clustering and K-means
• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning
7
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree
• Unsupervised learning
– Clustering and K-means
• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning
8
Supervised Learning
• In supervised learning problems, each training
data sample
– A feature/input vector
– A label/output vector
𝒙𝑛 , 𝒚 𝑛 Cat
𝒙𝑛 feature/input 𝒚𝑛 label/output
9
Linear regression
• Basic example of a supervised learning
algorithm
• Captures many fundamental machine learning
concepts
– Function approximation
– Bias-variance tradeoff
– Regularization
– Training/validation/test split
– Optimization and gradient descent
10
Agenda
• Function approximation
– Modern strategy for designing machine learning
algorithms
– By example: Linear regression, a simple machine
learning algorithm
• Bias-variance tradeoff
– Fundamental challenge in machine learning
11
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑓𝛽 𝒙 = 𝛽 T 𝒙
• 𝒙 ∈ ℝ𝑑 is an input vector (a.k.a. features)
• 𝜷 ∈ ℝ𝑑 is a parameter vector
12
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮
𝑥𝑑
• 𝒙 ∈ ℝ𝑑 is an input vector (a.k.a. features)
• 𝜷 ∈ ℝ𝑑 is a parameter vector
13
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮ = 𝛽1 𝑥 1 + ⋯+ 𝛽𝑑 𝑥 𝑑
𝑥𝑑
• 𝒙 ∈ ℝ𝑑 is called an input (a.k.a. features)
• 𝜷 ∈ ℝ𝑑 is called the parameters (a.k.a. parameter vector)
• 𝑦 = 𝑓𝜷 𝒙 is called the label (a.k.a. output)
14
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
• Typical notation
• Use 𝑖 to index examples 𝑥𝑖 , 𝑦𝑖 in data 𝑍
• Use 𝑗 to index components 𝑥𝑗 of 𝒙 ∈ ℝ𝑑
• 𝑥𝑖𝑗 is component 𝑗 of input example 𝑖
• Goal: Estimate 𝛽 ∈ ℝ𝑑
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
𝑥i , 𝑦𝑖
𝑥 = 800, 𝑦 =?
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
What does this mean?
𝑥i , 𝑦𝑖
𝑥 = 800, 𝑦 =?
Choice of Loss Function
T T 2
• 𝑦𝑖 ≈ 𝛽 𝑥𝑖 if 𝑦𝑖 − 𝛽 𝑥𝑖 small
𝑦 𝑓𝛽 𝑥 = 𝛽 T 𝑥
• Mean squared error (MSE):
𝐿 𝛽; 𝑍 = 𝑛1 σ𝑛𝑖=1
2
𝑦𝑖 −𝛽 T 𝑥𝑖
• Computationally convenient 𝑥
and works well in practice
2 + 2 + 2 + 2 + 2
𝐿 𝛽; 𝑍 =
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛
• Compute
𝛽መ 𝑍 = arg min𝑑 𝐿 𝛽; 𝑍
𝛽∈ℝ
1 2
= arg min𝑑 σ𝑛𝑖=1 𝑦𝑖 − 𝛽 𝑥𝑖T
𝛽∈ℝ 𝑛
•Output: 𝑓𝛽 𝑍 𝑥 = 𝛽መ 𝑍 𝑇 𝑥
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
22
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=1
23
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽 = 0.5
24
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=0
25
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
26
Intuition on Minimizing MSE Loss
• Convex
𝛽1
𝛽2
27
“Good” Mean Squared Error?
• Need to compare to baseline!
– I.e., Constant prediction
• Later: Training vs. test MSE
28
Alternative Loss Functions
• Mean absolute error:
𝑛
1
𝑦ෝ𝑖 − 𝑦𝑖
𝑛
𝑖=1
• Mean relative error:
𝑛
1 𝑦ෝ𝑖 − 𝑦𝑖
𝑛 𝑦𝑖
𝑖=1
𝟐 MSE
𝑹 score : 1 − Variance
– “Coefficient of determination”
– Higher is better 𝑅 2 = 1 is perfect 29
Loss Function Comparisons
Key characteristics Application scenarios
• Sensitivity to outliers
Large errors are particularly
MSE • Unit is different from
undesirable
original data
• Less Sensitive to
outliers
MAE Requires robust model
• Unit is the same from
original data
Suitable for comparing
• Relative Metric
MRE performance across different
• Scale-Invariant
scales or datasets
30
Function Approximation
Data 𝑍 Machine learning Model 𝑓
algorithm
ML algorithm outputs a model 𝑓 that best
“approximates” the given data 𝑍
31
Function Approximation
• Two design decisions
– What is the family of candidate models 𝑓? (E.g.,
linear functions)
– How to define “approximating”? (E.g., MSE loss)
32
True Function
• Input: Dataset 𝑍
– Presume there is an unknown function 𝑓 ∗ that
generates 𝑍
• Goal: Find an approximation 𝑓𝛽 ≈ 𝑓 ∗ in our
model family 𝑓𝛽 ∈ 𝐹
– Typically, 𝑓 ∗ is not in our model family 𝐹
33
Linear Regression
𝑛
Data 𝑍 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍 Model 𝑓𝛽 𝑍
𝛽
𝐿 encodes 𝑦𝑖 ≈ 𝑓𝛽 𝑥𝑖
MSE loss Model is a linear function 𝑓𝛽 𝑥 = 𝛽𝑇 𝑥
34
Linear Regression
General strategy Linear regression strategy
• Model family • Linear functions
𝐹 = 𝑓𝛽 𝐹 = 𝑓𝛽 𝑥 = 𝛽𝑇 𝑥
𝛽
• Loss function • MSE
1
𝐿 𝛽; 𝑍 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝛽𝑇 𝑥𝑖 2
𝑛
Linear regression algorithm
𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍
𝛽
35
Motivation: Quadratic Function
36
Motivation: Quadratic Function
𝑓𝛽 𝑥 = 𝑥/2
37
Motivation: Quadratic Function
𝑓𝛽 𝑥 = 𝑥
Can we get a better fit?
38
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛
39
Quadratic Feature Map
• Consider the feature map 𝜙: ℝ → ℝ2 given by
𝑥
𝜙 𝑥 = 𝑥2
• Then, the model family is
𝑓𝜷 𝒙 = 𝛽1 𝑥 + 𝛽2 𝑥 2
40
Quadratic Feature Map
𝑓𝜷 𝒙 = 0𝑥 + 1𝑥 2
In our family for 𝛽 = 0 !
1
41
Question
• Why not throw in lots of features?
– Can fit any 𝑛 points using a polynomial of degree
𝑛
𝑓𝜷 𝒙 = 𝛽0 + 𝛽2 𝑥1 + 𝛽3 𝑥2 + 𝛽4 𝑥 12 + ⋯
𝑓𝜷 𝒙
𝑥 42
Training vs. Test Data
• Training data: Examples 𝑍 = 𝑥, 𝑦 used to
fit our model
• Test data: New inputs 𝑥 whose labels 𝑦 we
want to predict
43
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙
𝑥
The errors on new inputs are very large! 44
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙
𝑓𝜷 𝒙
𝑥
Vanilla linear regression actually works better! 45
Overfitting vs. Underfitting
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly
𝑥
46
Overfitting vs. Underfitting
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly
𝑓𝛽 𝑥
𝑥 47
Why Does Overfitting Happen?
• Noise in the data
– Noise in labels 𝑦: True data generating process is
complex
– Noise in features 𝑥
• Measurement error in the feature values
• Errors due to preprocessing
• Some features may be irrelevant to the function
• The training data size is too small
– The dataset does not contain enough data samples to
accurately represent all possible input data values.
• The model trains for too long on a single sample
set of data.
48
Training/Test Split
• Issue: How to detect overfitting vs.
underfitting?
• Solution: Use held-out test data to estimate
loss on new data
– Typically, randomly shuffle data first
Given data 𝑍
Training data Test data
49
Training/Test Split Algorithm
• Step 1: Split 𝑍 into 𝑍𝑡𝑟𝑎𝑖𝑛 and 𝑍𝑡𝑒𝑠𝑡
Training data 𝑍𝑡𝑟𝑎𝑖𝑛 Test data 𝑍𝑡𝑒𝑠𝑡
• Step 2: Run linear regression with 𝑍𝑡𝑟𝑎𝑖𝑛 to obtain 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛
• Step 3: Evaluate
• Training loss: 𝐿𝑡𝑟𝑎𝑖𝑛 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑟𝑎𝑖𝑛
• Test (or generalization) loss: 𝐿𝑡𝑒𝑠𝑡 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑒𝑠𝑡
Training/Test Split Algorithm
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly
𝑥
51
Training/Test Split Algorithm
• Overfitting
– 𝐿train is small
– 𝐿test is large
𝑥
52
Training/Test Split Algorithm
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly
𝑓𝛽 𝑥
𝑥 53
Training/Test Split Algorithm
• Underfitting
– 𝐿train is large
– 𝐿test is also large
𝑓𝛽 𝑥
𝑥 54
How to Fix Underfitting/Overfitting?
• Choose the right model family!
55
Role of Capacity
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension
56
Bias-Variance Tradeoff
Underfitting Ideal Overfitting
Loss
Test loss
Training loss
Capacity
57
Example of Underfitting/Overfitting
• House Prices Prediction with linear regression.
• Data fields:
1. Total Area: Total area of the house excluding the
basement
2. Bedrooms: The number of bedrooms in the
house.
3. Living Area: The living area inside the house.
4. Parking Area: The area of the parking lot.
5. …
• Target prediction value : The transaction price of
houses.
58
Example of Underfitting/Overfitting
• Exploratory Data Analysis
59
Example of Underfitting/Overfitting
• Exploratory Data Analysis
60
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.
𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
61
Example of Underfitting/Overfitting
• Exploratory Data Analysis
62
Example of Underfitting/Overfitting
• Exploratory Data Analysis
63
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of 𝑥
𝑥1
= 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
64
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .
∗
w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22
65
Example of Underfitting/Overfitting
• Degree-1
• Degree-3
• Degree-6
66