Lecture Notes: Machine Learning
Fundamentals & Deep Learning
I. Learning Algorithms
1. What is a Learning Algorithm?
Definition: A method that improves performance (P) on a task (T) through
experience (E).
Components:
o Task (T): Classification, regression, clustering.
o Performance Metric (P): Accuracy, MSE, F1-score.
o Experience (E): Training data (labeled or unlabeled).
2. Model Capacity
Definition: A model’s ability to fit complex functions.
Low Capacity: Underfitting (e.g., linear models).
High Capacity: Overfitting (e.g., deep neural networks).
VC Dimension: Measures capacity for classification tasks.
II. Overfitting & Underfitting
1. Overfitting
Symptoms:
o Low training error, high test error.
o Model memorizes noise in training data.
Solutions:
o Regularization (L1/L2, dropout).
o Cross-validation.
o Simplify the model.
2. Underfitting
Symptoms:
o High training and test error.
o Model is too simple.
Solutions:
o Increase model capacity.
o Feature engineering.
III. Hyperparameters & Validation Sets
1. Hyperparameters vs. Parameters
Hyperparameters: Set before training (e.g., learning rate, batch size).
Parameters: Learned during training (e.g., weights in a neural network).
2. Validation Sets
Purpose: Tune hyperparameters without touching the test set.
Techniques:
o Holdout Validation: Split data into train/validation/test.
o k-Fold Cross-Validation: Average performance over k splits.
IV. Bias & Variance
1. Definitions
Bias: Error due to overly simplistic assumptions.
Variance: Error due to sensitivity to small fluctuations in training data.
2. Tradeoff
High Bias → Underfitting.
High Variance → Overfitting.
Balanced Model: Optimal complexity (e.g., regularization).
V. Maximum Likelihood Estimation (MLE)
Goal: Find parameters that maximize the likelihood of observed data.
Formula:
θMLE=argmaxθP(X∣θ)θMLE=argθmaxP(X∣θ)
Example: MLE for Gaussian distribution → Sample mean & variance.
VI. Bayesian Statistics
Bayes’ Theorem:
P(θ∣X)=P(X∣θ)P(θ)P(X)P(θ∣X)=P(X)P(X∣θ)P(θ)
Key Concepts:
o Prior: Belief before seeing data.
o Posterior: Updated belief after observing data.
o Conjugate Priors: Simplify posterior computation (e.g., Beta for Bernoulli).
VII. Supervised Learning Algorithms
1. Linear Regression
Objective: Minimize MSE.
Closed-form solution:
w=(XTX)−1XTyw=(XTX)−1XTy
2. Logistic Regression
Objective: Maximize log-likelihood (binary classification).
Sigmoid function:
P(y=1∣x)=11+e−wTxP(y=1∣x)=1+e−wTx1
3. Decision Trees
Splitting Criteria: Gini impurity, information gain.
Pruning: Prevent overfitting.
VIII. Unsupervised Learning
1. Clustering (k-Means)
Steps:
1. Initialize k centroids.
2. Assign points to nearest centroid.
3. Update centroids.
4. Repeat until convergence.
2. PCA (Dimensionality Reduction)
Goal: Project data onto lower-dimensional space.
Eigenvalue decomposition:
XTX=VΛVTXTX=VΛVT
IX. Stochastic Gradient Descent (SGD)
Update Rule:
wt+1=wt−η∇L(wt)wt+1=wt−η∇L(wt)
Variants:
o Momentum: Accumulate past gradients.
o Adam: Adaptive learning rates.
X. Building a Machine Learning Algorithm
1. Define the task (e.g., classification).
2. Choose a model (e.g., neural network).
3. Select a loss function (e.g., cross-entropy).
4. Optimize (e.g., SGD).
5. Evaluate (e.g., test set accuracy).
XI. Challenges Motivating Deep Learning
Limitations of Shallow Models:
o Cannot learn hierarchical features.
o Struggle with high-dimensional data (e.g., images).
Deep Learning Solutions:
o Automatic feature extraction.
o Scalability with data.
XII. Deep Feedforward Networks
1. Learning XOR
Problem: Linear models fail (non-linearly separable).
Solution: 2-layer NN with non-linear activation (e.g., ReLU).
2. Gradient-Based Learning
Backpropagation: Chain rule for computing gradients.
Vanishing Gradients: Solved with ReLU, skip connections.
3. Hidden Units
ReLU: f(x)=max(0,x)f(x)=max(0,x) (avoids vanishing gradients).
Sigmoid/Tanh: Used in LSTMs.
4. Architecture Design
Wide vs. Deep: Tradeoff between parameters and abstraction.
Residual Networks (ResNet): Skip connections for deep networks.
XIII. Backpropagation & Optimization
1. Chain Rule
∂L∂w=∂L∂y∂y∂w∂w∂L=∂y∂L∂w∂y
2. Optimization Algorithms
SGD: Basic but noisy.
Adam: Combines momentum + adaptive learning rates.