CS229 Machine Learning Notes:Day 2
Andrew Ng (Lecture Notes, Autumn 2018)
Notes by Hardik Mahajan
May 30, 2025
Contents
1 Introduction to Linear Regression 3
1.1 Why Linear Regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Context: Housing Price Example . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Supervised Learning Framework 3
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Linear Regression Model 4
3.1 Single Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Vectorized Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Cost Function 4
4.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5 Gradient Descent 4
5.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.3 Learning Rate (𝛼) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Batch vs. Stochastic Gradient Descent 5
6.1 Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3 Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.4 Choosing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7 Normal Equations 6
7.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.3 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.4 Cost Function in Matrix Form . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.5 Derivation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
8 Practical Tips 7
9 Additional Insights 7
10 Resources 7
2
1 Introduction to Linear Regression
Linear regression is a foundational supervised learning algorithm, ideal for introducing core ma-
chine learning concepts such as cost functions, optimization, and model fitting. It predicts con-
tinuous outputs, making it suitable for applications like forecasting house prices, stock prices,
or temperatures. This lecture, part of Stanford’s CS229 (Autumn 2018) by Andrew Ng, covers
linear regression, batch and stochastic gradient descent, and the normal equations, laying the
groundwork for advanced algorithms.
1.1 Why Linear Regression?
• Simplicity: Easiest algorithm to understand optimization and model design.
• Practicality: Widely used for continuous predictions.
• Contrast with Classification: Regression predicts continuous outputs (e.g., steering an-
gles), while classification predicts discrete outputs (e.g., spam/not spam).
1.2 Context: Housing Price Example
The lecture uses a dataset from Craigslist (Portland, Oregon) to motivate linear regression:
• Features: House size in square feet (𝑥).
• Output: Price in thousands of dollars (𝑦).
• Goal: Fit a straight line to predict price from size.
• Insight: Assumes a linear relationship, which may oversimplify real-world data but serves
as a starting point.
2 Supervised Learning Framework
Supervised learning involves learning from labeled data to map inputs (𝑥) to outputs (𝑦). The
process is:
1. Collect a training set: {(𝜉, 𝑦 (𝑖) )}𝑖=1
𝑚 .
2. Feed to a learning algorithm.
3. Output a hypothesis ℎ(𝑥).
4. Predict 𝑦 for new 𝑥.
2.1 Notation
• 𝑚: Number of training examples.
• 𝑛: Number of features.
• 𝜉: Feature vector for 𝑖-th example (𝑛 + 1-dimensional, with 𝑥 0 = 1).
• 𝑦 (𝑖) : Output for 𝑖-th example.
• 𝜃: Parameter vector (𝑛 + 1-dimensional).
3
• Insight: Standardized notation scales to complex models like neural networks.
3 Linear Regression Model
3.1 Single Feature
The hypothesis for one feature is:
ℎ(𝑥) = 𝜃 0 + 𝜃 1 𝑥
where 𝑥 is the house size, and ℎ(𝑥) is the predicted price.
3.2 Multiple Features
For multiple features (e.g., size 𝑥 1 , bedrooms 𝑥 2 ):
ℎ(𝑥) = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2
Compact form:
∑
𝑛
ℎ(𝑥) = 𝜃𝑗𝑥𝑗, 𝑥0 = 1
𝑗=0
3.3 Vectorized Form
ℎ(𝑥) = 𝜃 𝑇 𝑥, 𝜃, 𝑥 ∈ R𝑛+1
Insight: Vectorization simplifies coding and computation.
4 Cost Function
The cost function measures prediction error:
1∑
𝑚
𝐽 (𝜃) = (ℎ𝜃 (𝜉) − 𝑦 (𝑖) ) 2
2 𝑖=1
1
• 2: Simplifies derivatives.
• Squared error penalizes large deviations.
• Why Squared?: Convex (single minimum), assumes Gaussian noise (explained in later
lectures).
4.1 Visualization
A scatter plot shows house sizes vs. prices with a fitted line. (Placeholder: Imagine a scatter
plot with points at (2104, 400), (1416, 300), etc., and a red line 𝑦 = 0.15𝑥 + 100.)
5 Gradient Descent
5.1 Goal
Minimize 𝐽 (𝜃) iteratively.
4
5.2 Algorithm
1. Initialize 𝜃 (e.g., zeros).
2. Update:
∑
𝑚
𝜃 𝑗 := 𝜃 𝑗 − 𝛼 (ℎ𝜃 (𝜉) − 𝑦 (𝑖) ) · 𝑥 (𝑖)
𝑗
𝑖=1
3. Repeat until convergence.
5.3 Learning Rate (𝛼)
• Choosing 𝛼: Start with 0.01 (features scaled to [−1, 1]). Try 0.01, 0.03, 0.1, 0.3.
• Monitoring: If 𝐽 (𝜃) increases, reduce 𝛼; if slow, increase 𝛼.
• Insight: Feature scaling ensures consistent 𝛼.
5.4 Convergence
Linear regression has a convex 𝐽 (𝜃), ensuring a global minimum. A contour plot shows the
path of 𝜃 converging to the minimum. (Placeholder: Imagine an elliptical contour plot with a
red path from (0.5, 0.5) to (0.1, 0.05).)
6 Batch vs. Stochastic Gradient Descent
6.1 Batch Gradient Descent
• Uses all 𝑚 examples per update.
• Pros: Precise, converges to global minimum.
• Cons: Slow for large 𝑚 (e.g., millions).
• Use Case: Small datasets (hundreds/thousands).
6.2 Stochastic Gradient Descent
• Updates per example:
𝜃 𝑗 := 𝜃 𝑗 − 𝛼(ℎ𝜃 (𝜉) − 𝑦 (𝑖) ) · 𝑥 (𝑖)
𝑗
• Pros: Fast for large datasets (terabytes).
• Cons: Noisy path, oscillates near minimum.
𝛼0
• Enhancement: Decrease 𝛼 over time (e.g., 𝛼𝑡 = 1+𝑘𝑡 ).
• Stopping: Stop when 𝐽 (𝜃) plateaus.
6.3 Mini-Batch Gradient Descent
Uses small batches (e.g., 100 examples). Balances speed and stability. Common in practice.
5
6.4 Choosing Algorithms
• Small datasets: Batch.
• Large datasets: Stochastic or mini-batch.
• Insight: Stochastic solutions are often ”good enough.”
7 Normal Equations
7.1 Purpose
Solve for optimal 𝜃 in one step:
𝜃 = (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑦
where 𝑋 is the 𝑚 × (𝑛 + 1) design matrix, and 𝑦 is the 𝑚 × 1 label vector.
7.2 Advantages
• Single step, no iterations.
• Exact global minimum.
7.3 Disadvantages
• Expensive for large 𝑛 (𝑂 (𝑛3 ) for inversion).
• Non-invertible 𝑋 𝑇 𝑋: Use pseudo-inverse or remove redundant features.
7.4 Cost Function in Matrix Form
1
𝐽 (𝜃) = (𝑋𝜃 − 𝑦)𝑇 (𝑋𝜃 − 𝑦)
2
7.5 Derivation Outline
1. Expand 𝐽 (𝜃).
2. Take derivative w.r.t. 𝜃 using matrix derivatives.
3. Set to zero: 𝑋 𝑇 𝑋𝜃 = 𝑋 𝑇 𝑦.
4. Solve: 𝜃 = (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑦.
7.6 Visualization
A line plot compares convergence speed of batch, stochastic, and normal equations. (Place-
holder: Imagine a plot with batch decreasing gradually, stochastic oscillating, and normal equa-
tions jumping to minimum in one step.)
6
8 Practical Tips
• Feature Scaling: Normalize features to [−1, 1].
• Debugging: Plot 𝐽 (𝜃) vs. iterations.
• Implementation: Use Python/NumPy or scikit-learn.
9 Additional Insights
• Why Linear Regression Matters: Foundation for neural networks, generalized linear
models.
• Limitations: Assumes linearity, sensitive to outliers.
• Enhancement: Use robust regression for outliers.
• Future Topics: Generalized linear models, neural networks.
10 Resources
• CS229 lecture notes for derivations.
• Problem Set 1 for matrix derivative practice.
• Discussion sections for calculus refreshers.