0% found this document useful (0 votes)
52 views7 pages

Unit 1 Notes

The lecture notes cover fundamental concepts in machine learning and deep learning, including learning algorithms, model capacity, overfitting and underfitting, hyperparameters, and validation techniques. Key topics also include bias-variance tradeoff, maximum likelihood estimation, Bayesian statistics, supervised and unsupervised learning algorithms, and optimization methods like stochastic gradient descent. Additionally, the notes discuss the challenges of shallow models and the advantages of deep learning, including architecture design and backpropagation techniques.

Uploaded by

Suresh Kanigiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views7 pages

Unit 1 Notes

The lecture notes cover fundamental concepts in machine learning and deep learning, including learning algorithms, model capacity, overfitting and underfitting, hyperparameters, and validation techniques. Key topics also include bias-variance tradeoff, maximum likelihood estimation, Bayesian statistics, supervised and unsupervised learning algorithms, and optimization methods like stochastic gradient descent. Additionally, the notes discuss the challenges of shallow models and the advantages of deep learning, including architecture design and backpropagation techniques.

Uploaded by

Suresh Kanigiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture Notes: Machine Learning

Fundamentals & Deep Learning


I. Learning Algorithms
1. What is a Learning Algorithm?

 Definition: A method that improves performance (P) on a task (T) through


experience (E).
 Components:

o Task (T): Classification, regression, clustering.


o Performance Metric (P): Accuracy, MSE, F1-score.
o Experience (E): Training data (labeled or unlabeled).

2. Model Capacity

 Definition: A model’s ability to fit complex functions.


 Low Capacity: Underfitting (e.g., linear models).
 High Capacity: Overfitting (e.g., deep neural networks).
 VC Dimension: Measures capacity for classification tasks.

II. Overfitting & Underfitting


1. Overfitting

 Symptoms:

o Low training error, high test error.


o Model memorizes noise in training data.
 Solutions:

o Regularization (L1/L2, dropout).


o Cross-validation.
o Simplify the model.

2. Underfitting

 Symptoms:

o High training and test error.


o Model is too simple.
 Solutions:

o Increase model capacity.


o Feature engineering.

III. Hyperparameters & Validation Sets


1. Hyperparameters vs. Parameters

 Hyperparameters: Set before training (e.g., learning rate, batch size).


 Parameters: Learned during training (e.g., weights in a neural network).

2. Validation Sets

 Purpose: Tune hyperparameters without touching the test set.


 Techniques:

o Holdout Validation: Split data into train/validation/test.


o k-Fold Cross-Validation: Average performance over k splits.

IV. Bias & Variance


1. Definitions
 Bias: Error due to overly simplistic assumptions.
 Variance: Error due to sensitivity to small fluctuations in training data.

2. Tradeoff

 High Bias → Underfitting.


 High Variance → Overfitting.
 Balanced Model: Optimal complexity (e.g., regularization).

V. Maximum Likelihood Estimation (MLE)


 Goal: Find parameters that maximize the likelihood of observed data.
 Formula:
θMLE=arg⁡max⁡θP(X∣θ)θMLE=argθmaxP(X∣θ)

 Example: MLE for Gaussian distribution → Sample mean & variance.

VI. Bayesian Statistics


 Bayes’ Theorem:

P(θ∣X)=P(X∣θ)P(θ)P(X)P(θ∣X)=P(X)P(X∣θ)P(θ)

 Key Concepts:

o Prior: Belief before seeing data.


o Posterior: Updated belief after observing data.
o Conjugate Priors: Simplify posterior computation (e.g., Beta for Bernoulli).
VII. Supervised Learning Algorithms
1. Linear Regression

 Objective: Minimize MSE.


 Closed-form solution:
w=(XTX)−1XTyw=(XTX)−1XTy

2. Logistic Regression

 Objective: Maximize log-likelihood (binary classification).


 Sigmoid function:
P(y=1∣x)=11+e−wTxP(y=1∣x)=1+e−wTx1

3. Decision Trees

 Splitting Criteria: Gini impurity, information gain.


 Pruning: Prevent overfitting.

VIII. Unsupervised Learning


1. Clustering (k-Means)

 Steps:

1. Initialize k centroids.
2. Assign points to nearest centroid.
3. Update centroids.
4. Repeat until convergence.

2. PCA (Dimensionality Reduction)


 Goal: Project data onto lower-dimensional space.
 Eigenvalue decomposition:
XTX=VΛVTXTX=VΛVT

IX. Stochastic Gradient Descent (SGD)


 Update Rule:

wt+1=wt−η∇L(wt)wt+1=wt−η∇L(wt)

 Variants:

o Momentum: Accumulate past gradients.


o Adam: Adaptive learning rates.

X. Building a Machine Learning Algorithm


1. Define the task (e.g., classification).
2. Choose a model (e.g., neural network).
3. Select a loss function (e.g., cross-entropy).
4. Optimize (e.g., SGD).
5. Evaluate (e.g., test set accuracy).

XI. Challenges Motivating Deep Learning


 Limitations of Shallow Models:
o Cannot learn hierarchical features.
o Struggle with high-dimensional data (e.g., images).
 Deep Learning Solutions:

o Automatic feature extraction.


o Scalability with data.

XII. Deep Feedforward Networks


1. Learning XOR

 Problem: Linear models fail (non-linearly separable).


 Solution: 2-layer NN with non-linear activation (e.g., ReLU).

2. Gradient-Based Learning

 Backpropagation: Chain rule for computing gradients.


 Vanishing Gradients: Solved with ReLU, skip connections.

3. Hidden Units

 ReLU: f(x)=max⁡(0,x)f(x)=max(0,x) (avoids vanishing gradients).


 Sigmoid/Tanh: Used in LSTMs.

4. Architecture Design

 Wide vs. Deep: Tradeoff between parameters and abstraction.


 Residual Networks (ResNet): Skip connections for deep networks.

XIII. Backpropagation & Optimization


1. Chain Rule

∂L∂w=∂L∂y∂y∂w∂w∂L=∂y∂L∂w∂y

2. Optimization Algorithms

 SGD: Basic but noisy.


 Adam: Combines momentum + adaptive learning rates.

You might also like