0% found this document useful (0 votes)
12 views25 pages

Notes Machine Learning

The document covers key concepts in machine learning, including Bayesian Decision Theory, classification methods, losses and risks, discriminant functions, utility theory, and association rules. It also discusses supervised learning algorithms, non-parametric methods like K-Nearest Neighbors, and dimensionality reduction techniques such as PCA and factor analysis. Each section provides foundational knowledge and practical applications relevant to machine learning and data analysis.

Uploaded by

ShyamShyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Notes Machine Learning

The document covers key concepts in machine learning, including Bayesian Decision Theory, classification methods, losses and risks, discriminant functions, utility theory, and association rules. It also discusses supervised learning algorithms, non-parametric methods like K-Nearest Neighbors, and dimensionality reduction techniques such as PCA and factor analysis. Each section provides foundational knowledge and practical applications relevant to machine learning and data analysis.

Uploaded by

ShyamShyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning – Detailed Notes

Unit 1: Machine Learning Fundamentals

 Bayesian Decision Theory

 Classification

 Losses and Risks

 Discriminant Functions

 Utility Theory

 Association Rules

1. Bayesian Decision Theory

Bayesian Decision Theory provides a structured way to make decisions under


uncertainty using probabilities. It combines prior knowledge (priors) with
evidence (likelihood) to compute a posterior probability and make decisions
that minimize expected loss.

Bayes’ Theorem:
P(C_k | x) = [P(x | C_k) * P(C_k)] / P(x)
Where:

 P(C_k | x): Posterior probability of class C_k given input x

 P(x | C_k): Likelihood of x under class C_k

 P(C_k): Prior probability of class C_k

 P(x): Marginal probability of x

A classifier chooses the class with the highest posterior probability for a
given input.

2. Classification

Classification is the process of assigning a class label to input data based on


learned decision boundaries. It includes binary and multiclass problems.

Types of classification methods:

 Parametric: These methods assume that the data follows a certain


distribution (like Gaussian) and rely on a fixed number of parameters
to define the model. Once trained, the model uses these parameters to
make predictions. Examples include Logistic Regression and Linear
Discriminant Analysis. They are efficient and require less training data,
but may not perform well if the assumed distribution is incorrect.

 Non-parametric: These methods make no strong assumptions about


the underlying data distribution. Instead, they adapt their structure
based on the training data. They are more flexible and can model
complex patterns, but typically require more data and computation.
Examples include K-Nearest Neighbors (K-NN) and Decision Trees.

Classification is used in spam detection, medical diagnosis, image


recognition, etc.

3. Losses and Risks

Loss represents the cost of making incorrect predictions, while risk is the
expected value of that loss over possible outcomes. These concepts help
guide decision-making in probabilistic classification systems.

Loss Function:
L(i | j): This function defines the penalty incurred when the actual class is j,
but the system predicts class i. For example, misclassifying a spam email as
non-spam may incur a different cost than the reverse.

Expected Risk:
R(α_i | x) = Σ_j L(α_i | y_j) * P(y_j | x)
This calculates the average loss for choosing action α_i (e.g., predicting class
i) given the input x. It accounts for all possible true classes y_j and their
posterior probabilities.

Bayes Risk:
R*(x) = min_i R(α_i | x)
This is the minimum possible risk achievable for input x. The classification
decision rule that always chooses the class with the lowest expected risk is
called the Bayes decision rule.

Real-world example: In medical diagnosis, predicting 'no disease' when


the patient actually has a disease (false negative) could be far more harmful
than the opposite error (false positive). Hence, the loss assigned to a false
negative is much higher. Similarly, in fraud detection, failing to catch a
fraudulent transaction (false negative) may cost the company much more
than mistakenly flagging a valid transaction (false positive). These
considerations directly shape the loss matrix and influence classification
decisions. of making incorrect predictions, while risk is the expected value of
that loss over possible outcomes. These concepts help guide decision-
making in probabilistic classification systems.

Loss Function:
L(i | j): This function defines the penalty incurred when the actual class is j,
but the system predicts class i. For example, misclassifying a spam email as
non-spam may incur a different cost than the reverse.

Expected Risk:
R(α_i | x) = Σ_j L(α_i | y_j) * P(y_j | x)
This calculates the average loss for choosing action α_i (e.g., predicting class
i) given the input x. It accounts for all possible true classes y_j and their
posterior probabilities.

Bayes Risk:
R*(x) = min_i R(α_i | x)
This is the minimum possible risk achievable for input x. The classification
decision rule that always chooses the class with the lowest expected risk is
called the Bayes decision rule.

In practice, we may define different loss matrices based on how critical each
type of misclassification is. This allows tailoring classifiers to specific
application needs—e.g., false negatives in medical diagnosis may carry much
higher cost than false positives.

4. Discriminant Functions

Discriminant functions compute a score for each class and the class with the
highest score is chosen for classification. These functions form decision
boundaries in the feature space, which separate different classes.

Linear Discriminant Function:


This assumes that the classes share the same covariance matrix and leads to
linear decision boundaries.
g_i(x) = w^T x + w_0
Where:

 x: feature vector

 w: weight vector

 w_0: bias or threshold term

Quadratic Discriminant Function:


Used when each class has its own covariance matrix. This results in
quadratic decision surfaces, allowing for more flexibility in separating
classes.

Discriminant functions are fundamental to methods like Linear Discriminant


Analysis (LDA) and Quadratic Discriminant Analysis (QDA), where they are
derived based on class conditional probabilities and Bayes' rule.

Linear vs. Quadratic Decision Boundaries (Sketch Explanation):


Imagine two classes in a 2D feature space. If the decision boundary is a
straight line, it’s linear (LDA). If the boundary is a curve (e.g., ellipse or
parabola), it’s quadratic (QDA). Linear boundaries are efficient but limited in
flexibility; quadratic boundaries are more adaptive to real-world data
distributions.

A helpful sketch might show:

 A straight line separating two clusters (LDA)

 A curved boundary adjusting to differently shaped clusters (QDA) for


each class and the class with the highest score is chosen for
classification. These functions form decision boundaries in the feature
space, which separate different classes.

Linear Discriminant Function:


This assumes that the classes share the same covariance matrix and leads to
linear decision boundaries.
g_i(x) = w^T x + w_0
Where:

 x: feature vector

 w: weight vector

 w_0: bias or threshold term

Quadratic Discriminant Function:


Used when each class has its own covariance matrix. This results in
quadratic decision surfaces, allowing for more flexibility in separating
classes.

Discriminant functions are fundamental to methods like Linear Discriminant


Analysis (LDA) and Quadratic Discriminant Analysis (QDA), where they are
derived based on class conditional probabilities and Bayes' rule.

Linear Discriminant Function:


g_i(x) = w^T x + w_0
Quadratic Discriminant Function: Used when class covariance matrices
are different.

Used in LDA, QDA, and general linear models.

5. Utility Theory

Utility theory focuses on choosing actions that maximize expected benefit.


Unlike loss-based decision frameworks that minimize cost, utility-based
approaches aim to maximize rewards associated with specific outcomes. This
is particularly useful when outcomes have varying degrees of desirability.

Utility Function: U(i | j) gives the utility of predicting class i when the true
class is j.

For example, in a medical context, detecting a serious disease correctly may


offer very high utility, while a false alarm may only result in minor
inconvenience. By defining utility scores for each outcome, we tailor the
decision-making process to real-world priorities.

Expected Utility:
EU(α_i | x) = Σ_j U(α_i | y_j) * P(y_j | x)

A decision rule that maximizes expected utility can lead to more effective
outcomes in applications where some correct predictions are much more
valuable than others.

Utility Function: U(i | j) gives the utility of predicting class i when the true
class is j.
Expected Utility:
EU(α_i | x) = Σ_j U(α_i | y_j) * P(y_j | x)

Decisions are made to maximize expected utility rather than minimize risk.

6. Association Rules

Association rule mining uncovers relationships among variables in large


datasets. These relationships are expressed as rules that highlight how the
presence of certain items influences the occurrence of others. This is
commonly applied in market basket analysis.

Metrics:

 Support(A → B): The proportion of transactions that contain both A


and B. It indicates how frequently a rule occurs.
 Confidence(A → B): The conditional probability that a transaction
containing A also contains B.

 Lift(A → B): The ratio of observed support to expected support if A


and B were independent. Lift > 1 indicates a positive correlation
between A and B.

Formulas:

 Support(A → B) = P(A ∩ B)

 Confidence(A → B) = P(A ∩ B) / P(A)

 Lift(A → B) = Confidence(A → B) / P(B)

These metrics are used to evaluate the strength and usefulness of rules.

Apriori Algorithm:

1. Identify frequent itemsets using a minimum support threshold.

2. Generate strong rules from these itemsets with high confidence.


Unit 2: Supervised Learning Algorithms & Non-Parametric Methods

1. Supervised Learning Overview

Supervised learning uses labeled datasets to teach models to map inputs to


outputs. The goal is to learn a function that accurately predicts the output
from unseen inputs. It forms the basis for most real-world machine learning
applications.

In this approach, each training example is a pair consisting of an input vector


and a desired output label. The model iteratively adjusts its internal
parameters to minimize the difference between predicted and actual
outputs, often using techniques such as gradient descent.

Applications include:

 Spam email detection

 Medical diagnosis

 Fraud detection

 Sentiment analysis
The typical workflow includes data preprocessing, model training, validation,
testing, and tuning. Algorithms vary from simple (like linear regression) to
complex (like deep neural networks).. It is foundational in applications like
classification and regression.

Examples of algorithms:

 Linear Regression

 Logistic Regression

 Decision Trees

 Support Vector Machines (SVM)

 Neural Networks

The process includes training, validation, and testing phases to ensure


generalization.

2. Histogram Estimator

A histogram estimator approximates the probability density function (PDF) of


a continuous variable by grouping the data into discrete intervals (bins) and
counting the number of observations in each bin.

Formula:
f̂(x) = (Number of points in bin) / (n × h)
Where:

 n: number of total observations

 h: bin width (size of each interval)

This method provides a simple and intuitive way to estimate distributions,


but it is highly sensitive to bin width and bin edge placement. A small h may
lead to overfitting (high variance), while a large h may lead to underfitting
(high bias).

Advantages: Easy to compute, interpretable


Disadvantages: Discontinuous, sensitive to binning choice.

Formula:
f̂(x) = (Number of points in bin) / (n × h)
Where:

 h is bin width
 n is number of samples

It’s simple but sensitive to bin size and alignment.

3. Kernel Estimator

Kernel density estimation (KDE) is a non-parametric method to estimate the


probability density function (PDF) of a random variable. It smooths each data
point using a kernel function.

KDE Formula:
f̂(x) = (1 / nh) Σ K((x - xᵢ) / h)
Where:

 K: Kernel function (e.g., Gaussian, Epanechnikov)

 h: bandwidth (smoothing parameter)

 n: total number of data points

The choice of bandwidth (h) greatly affects the smoothness of the curve:

 Small h → overfitting

 Large h → oversmoothing

Advantages: Smooth estimate, more accurate than histograms


Disadvantages: Computationally expensive, bandwidth selection is non-
trivial, kernel estimators use a kernel function to weigh nearby points.

Kernel Density Estimate (KDE) Formula:


f̂(x) = (1 / nh) Σ K((x - xᵢ) / h)
Where:

 K: Kernel function (e.g., Gaussian)

 h: bandwidth parameter

KDE is more flexible and continuous but computationally intensive.

4. K-Nearest Neighbor (K-NN)

K-NN is a non-parametric, instance-based learning algorithm. It classifies a


data point based on the labels of its k-nearest neighbors in the training set.

Distance Metric (Euclidean):


d(x, y) = √Σ(xᵢ - yᵢ)²
Classification Rule:
Assign the label most common among the k closest samples.

Characteristics:

 No training phase (lazy learner)

 Stores all data points

 Performs well with locally clustered data

Challenges:

 Sensitive to irrelevant or redundant features

 Requires scaling of features

 Suffers in high-dimensional spaces (curse of dimensionality) based on


the majority label among its k-nearest neighbors.

Distance Metric (Euclidean):


d(x, y) = √Σ(xᵢ - yᵢ)²

Classification Rule:
Assign the class most common among the k closest training samples.

K-NN is simple and effective but scales poorly with large datasets and high
dimensionality.

Unit 3: Dimensionality Reduction

1. Introduction

Dimensionality reduction simplifies datasets by reducing the number of input


variables. It helps mitigate the curse of dimensionality, improves
computational efficiency, and enhances visualization and interpretability. By
focusing on the most relevant features, it can improve the performance of
learning algorithms and reduce overfitting.

There are two broad types of dimensionality reduction:

 Feature Selection: Selecting a subset of original features.

 Feature Extraction: Creating new features from transformations of


original ones (e.g., PCA).

It is especially important in high-dimensional spaces, such as image data,


gene expression datasets, or natural language processing. by reducing the
number of input variables. It helps mitigate the curse of dimensionality,
improves computational efficiency, and can enhance visualization and
interpretability.

2. Subset Selection

This technique selects a subset of the original variables based on criteria


such as correlation with the output or information gain. Methods include
forward selection, backward elimination, and recursive feature elimination.

3. Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique that identifies the


directions (principal components) in which the data varies the most. It
transforms correlated variables into a smaller number of uncorrelated
variables while preserving as much variability as possible.

Steps:

1. Standardize the data (mean = 0, variance = 1)

2. Compute the covariance matrix

3. Calculate eigenvalues and eigenvectors of the covariance matrix

4. Sort eigenvectors by eigenvalue magnitude

5. Project original data onto top k eigenvectors

PCA Formula:
Z = XW
Where:

 X: mean-centered data matrix

 W: matrix of selected eigenvectors (principal components)

 Z: projected data

Advantages:

 Reduces redundancy in features

 Speeds up learning algorithms

 Helps in data visualization

Disadvantages:

 Components may be hard to interpret


 Only captures linear relationships into a new set of uncorrelated
variables (principal components) ordered by variance.

Steps:

1. Standardize the data

2. Compute the covariance matrix

3. Calculate eigenvalues and eigenvectors

4. Project data onto principal components

PCA Formula:
Z = XW
Where:

 X: mean-centered data

 W: eigenvectors

 Z: transformed data

4. Factor Analysis

Factor analysis is a statistical method used to explain variability among


observed variables in terms of fewer unobserved latent variables called
factors. It assumes that each observed variable is influenced by common
factors and unique error components.

Model:
X = LF + ε
Where:

 X: observed variables (vector)

 L: loading matrix (weights mapping factors to variables)

 F: latent common factors (vector)

 ε: specific errors or noise

Applications:

 Psychometrics (e.g., intelligence testing)

 Social sciences (e.g., personality factors)

Advantages:
 Reveals hidden structures

 Reduces complexity in correlated data

Disadvantages:

 Assumes linear relationships

 Sensitive to data scaling and initial assumptions using fewer


unobserved latent variables (factors). It assumes that variance comes
from both common factors and unique noise components.

Model:
X = LF + ε
Where:

 X: observed variables

 L: loading matrix

 F: factors

 ε: noise

5. Singular Value Decomposition (SVD)

SVD is a powerful matrix decomposition technique used in signal processing,


statistics, and machine learning. It decomposes any m×n matrix A into three
matrices:

Formula:
A = UΣVᵀ
Where:

 A: original data matrix

 U: matrix of left singular vectors

 Σ: diagonal matrix of singular values

 Vᵀ: transpose of matrix of right singular vectors

Applications:

 Image compression

 Noise reduction

 Latent Semantic Analysis (LSA) in NLP


Advantages:

 Works on any matrix (square or rectangular)

 Captures important patterns in data

Disadvantages:

 Computationally expensive

 Hard to interpret singular vectors into three components:


A = UΣVᵀ
Where:

 A: original matrix

 U, V: orthogonal matrices

 Σ: diagonal matrix of singular values

Used in dimensionality reduction and text processing (e.g., Latent Semantic


Indexing).

6. Matrix Factorization

Matrix factorization approximates a matrix as the product of two smaller


matrices. It is widely used in recommender systems (e.g., collaborative
filtering).

Model:
R ≈ P × Qᵀ
Where:

 R: rating matrix

 P: user features

 Q: item features

7. Multidimensional Scaling (MDS)

MDS is a set of techniques that visualize the level of similarity or dissimilarity


of data in lower dimensions. It attempts to place each object in N-
dimensional space such that the between-object distances are preserved as
well as possible.

Goal:
Given a distance matrix D, find a configuration of points X such that their
pairwise Euclidean distances approximate D.
Types:

 Metric MDS: Preserves actual distances

 Non-metric MDS: Preserves rank order of distances

Applications:

 Psychology (perceptual maps)

 Marketing (product similarity)

 Bioinformatics (gene expression)

Limitations:

 Sensitive to noise

 Less effective with non-Euclidean or sparse distance matrices while


preserving pairwise distances. It's especially useful for visualizing
similarities or dissimilarities between samples.

Objective:
Minimize the difference between original and projected distances.

8. Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction method that finds linear


combinations of features that best separate two or more classes. Unlike PCA,
which focuses on variance, LDA maximizes the separability between classes.

Objective Function:
Maximize J(w) = (wᵀ S_B w) / (wᵀ S_W w)
Where:

 S_B: Between-class scatter matrix

 S_W: Within-class scatter matrix

 w: projection vector

Steps:

1. Compute class means and overall mean

2. Compute S_W and S_B

3. Solve the generalized eigenvalue problem

4. Select top eigenvectors as projection axes


Applications:

 Face recognition

 Medical diagnosis

 Text classification

Advantages:

 Works well when class distributions are Gaussian

 Improves class separability

Disadvantages:

 Assumes equal class covariances

 Sensitive to outliers of features that best separates classes.

Objective Function:
Maximize J(w) = (wᵀ S_B w) / (wᵀ S_W w)
Where:

 S_B: between-class scatter

 S_W: within-class scatter

LDA is supervised and complements PCA by optimizing class separability


instead of variance.

Unit 4: Unsupervised Learning

1. Introduction

Unsupervised learning algorithms uncover hidden patterns or intrinsic


structures in data without labeled responses. These methods explore the
data’s internal structure and groupings without relying on known outcomes
or supervisory signals.

It is especially useful when labeled data is unavailable or too expensive to


obtain. The two main goals are:

 Clustering: Grouping similar data points into clusters.

 Dimensionality Reduction: Reducing the number of features while


preserving structure.

Common applications include market segmentation, anomaly detection,


social network analysis, and gene expression profiling. or intrinsic structures
in data without labeled responses. Clustering is a primary form of
unsupervised learning, often used in exploratory data analysis.

2. Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters using


agglomerative (bottom-up) or divisive (top-down) approaches. It doesn’t
require specifying the number of clusters in advance and provides flexibility
in deciding the level of clustering by cutting the dendrogram at a desired
height.

Agglomerative Clustering Steps:

1. Start with each data point as its own cluster.

2. Compute pairwise distances.

3. Merge the closest pair of clusters based on a linkage method.

4. Repeat until one cluster remains or desired number is reached.

Linkage Criteria:

 Single Linkage: Minimum distance between any two points in clusters

 Complete Linkage: Maximum distance between any two points in


clusters

 Average Linkage: Mean distance between all points across two


clusters

Distance Function (Euclidean):


d(x, y) = √Σ(xᵢ - yᵢ)²

Advantages:

 Easy to understand and visualize

 No need to pre-specify the number of clusters

Disadvantages:

 Sensitive to outliers

 Time complexity is higher than partitional methods (dendrogram) of


nested clusters using agglomerative (bottom-up) or divisive (top-down)
approaches.

Agglomerative Clustering Steps:


1. Start with each data point as its own cluster.

2. Merge the two closest clusters.

3. Repeat until all points are in one cluster.

Linkage Criteria:

 Single Linkage: min distance between elements

 Complete Linkage: max distance between elements

 Average Linkage: average pairwise distance

Distance Function (Euclidean):


d(x, y) = √Σ(xᵢ - yᵢ)²

3. Partitional Clustering

Partitional clustering methods divide data into a fixed number of k non-


overlapping clusters, optimizing a criterion such as intra-cluster distance. It
assumes each data point belongs to one cluster.

This approach is computationally more scalable than hierarchical clustering


and is widely used in large datasets.

Forgy's Algorithm

Forgy's method initializes cluster centroids by selecting k random data


points. Points are then assigned to the nearest centroid, and centroids are
recalculated. This process iterates until convergence.

K-Means Algorithm

K-Means is one of the most commonly used partitional clustering algorithms.

Steps:

1. Choose the number of clusters k

2. Randomly initialize k centroids

3. Assign each data point to the nearest centroid

4. Update centroids by computing the mean of assigned points

5. Repeat steps 3–4 until centroids stabilize


J = Σ_k Σ_{xᵢ ∈ C_k} ||xᵢ - μ_k||²
Objective Function:

Where:

 xᵢ is a data point in cluster C_k

 μ_k is the centroid of cluster C_k

Advantages:

 Simple and fast

 Works well when clusters are spherical and equally sized

Disadvantages:

 Requires specifying k in advance

 Sensitive to initial centroid placement

 Not suitable for non-spherical clusters or data with varying densities


into k non-overlapping subsets (clusters).

Forgy's Algorithm

Randomly initialize k centroids and assign each point to the nearest centroid.

K-Means Algorithm

K-Means minimizes the sum of squared distances between data points and
their assigned cluster centers.

Steps:

1. Choose k initial centroids.

2. Assign each point to the nearest centroid.

3. Recalculate centroids as the mean of assigned points.

4. Repeat until convergence.

Objective Function:
J = Σ Σ ||xᵢ - μ_k||²
Where:

 xᵢ is a data point assigned to cluster k

 μ_k is the centroid of cluster k

K-means assumes spherical clusters and is sensitive to initialization.


Unit 5: Multilayer Perceptron and Backpropagation

1. The Perceptron

A perceptron is a basic neural network unit that models a single artificial


neuron. It computes a weighted sum of the inputs and passes it through an
activation function (usually a step function) to produce a binary output.

Mathematical Representation:
y = f(w·x + b)
Where:

 x = input vector

 w = weight vector

 b = bias term

 f = activation function (e.g., step or sign function)

Functionality:

 If w·x + b ≥ 0 → output is 1

 Otherwise → output is 0

Perceptrons are suitable for linearly separable problems like AND, OR but fail
with non-linearly separable cases like XOR. that computes a weighted sum of
inputs and applies an activation function (typically a step function). It's the
simplest model of a neuron.

Perceptron Function:
y=f(w⋅x+b)y = f(w \cdot x + b)
Where:

 xx: input vector

 ww: weight vector

 bb: bias

 ff: activation function (e.g., sign or step)

2. Training a Perceptron

The perceptron learns through a supervised learning rule that updates


weights based on the classification error. The objective is to find weights that
minimize the number of misclassifications.
Perceptron Learning Rule:
w_i ← w_i + η (y - ŷ) x_i
Where:

 w_i: weight for input x_i

 η: learning rate (0 < η < 1)

 y: actual class label

 ŷ: predicted label

Procedure:

1. Initialize weights and bias

2. For each training sample:

o Compute output

o Compare with true label

o Update weights if incorrect

3. Repeat until convergence or max iterations to minimize classification


error. The perceptron learning rule updates weights based on the
difference between predicted and actual output.

Weight Update Rule:


wi=wi+η(y−y^)xiw_i = w_i + \eta (y - \hat{y}) x_i
Where η\eta is the learning rate.

3. Learning Boolean Functions

Perceptrons are capable of learning simple Boolean functions that are


linearly separable.

Examples:

 AND: Output is 1 only when both inputs are 1

 OR: Output is 1 if at least one input is 1

 NOT: A unary function returning the opposite value

However, they cannot learn XOR because the data is not linearly separable.
This limitation led to the development of multilayer perceptrons which can
model non-linear decision boundaries. Boolean functions like AND and OR,
but not XOR due to its non-linear separability.
Example:

 AND: output is 1 only if both inputs are 1

 OR: output is 1 if at least one input is 1

 XOR: not linearly separable (requires multilayer network)

4. Multilayer Perceptrons (MLP)

MLPs consist of multiple layers of neurons: input layer, one or more hidden
layers, and an output layer. Each neuron in one layer is connected to all
neurons in the next layer.

Architecture:

 Input Layer: Receives input features

 Hidden Layer(s): Learns intermediate representations using non-


linear activation functions (ReLU, sigmoid, tanh)

 Output Layer: Produces predictions (e.g., softmax for classification)

MLPs can learn complex patterns and approximate any continuous function
(universal approximation theorem).

Advantages:

 Handles non-linearly separable data

 Highly expressive with enough hidden units, and output layers with
neurons using activation functions like sigmoid or ReLU. It can model
complex, non-linear decision boundaries.

It overcomes the limitations of single-layer perceptrons by introducing


hidden layers that enable hierarchical feature learning.

5. Backpropagation Algorithm

Backpropagation is a supervised learning algorithm used for training MLPs. It


computes gradients of the loss function with respect to each weight by the
chain rule and updates the weights to minimize the loss.

Steps:

1. Forward Pass: Compute outputs layer-by-layer

2. Compute Error: Loss = Actual - Predicted


3. Backward Pass: Compute gradients using derivatives of activation
functions

4. Weight Update: Apply gradient descent

Gradient Descent Formula:


w ← w - η ∂E/∂w
Where:

 η: learning rate

 E: error function

 ∂E/∂w: partial derivative of error w.r.t. weight

Loss Functions:

 Mean Squared Error (MSE)

 Cross-Entropy (for classification)

Example:
Suppose we have a simple neural network with one input neuron, one output
neuron, and no hidden layer. Let the input x = 1.0, the target output t = 0.0,
initial weight w = 0.5, and learning rate η = 0.1. Let the activation function
be identity (f(x) = x).

1. Forward pass:

o Output y = x * w = 1.0 * 0.5 = 0.5

2. Compute error:

o E = (1/2)(t - y)^2 = (1/2)(0.0 - 0.5)^2 = 0.125

3. Backward pass (gradient):

o dE/dw = (y - t) * x = (0.5 - 0.0) * 1.0 = 0.5

4. Update weight:

o w_new = w - η * dE/dw = 0.5 - 0.1 * 0.5 = 0.45

This demonstrates one iteration of backpropagation updating the weight to


reduce error. used for training MLPs. It computes gradients of the loss
function with respect to each weight by the chain rule and updates the
weights to minimize the loss.

Steps:
1. Forward Pass: Compute outputs layer-by-layer

2. Compute Error: Loss = Actual - Predicted

3. Backward Pass: Compute gradients using derivatives of activation


functions

4. Weight Update: Apply gradient descent

Gradient Descent Formula:


w ← w - η ∂E/∂w
Where:

 η: learning rate

 E: error function

 ∂E/∂w: partial derivative of error w.r.t. weight

Loss Functions:

 Mean Squared Error (MSE)

 Cross-Entropy (for classification). It minimizes the error by computing


gradients and updating weights via gradient descent.

Backpropagation Steps:

1. Forward Pass: Compute output using current weights.

2. Compute Error: Compare predicted and actual values.

3. Backward Pass: Propagate the error backward using derivatives.

4. Update Weights: Use gradient descent to minimize the error.

Gradient Descent Update Rule:


w=w−η∂E∂ww = w - \eta \frac{\partial E}{\partial w}
Where EE is the error function.

6. Training Procedures

Training procedures vary based on how frequently the weights are updated.

Methods:

 Stochastic Gradient Descent (SGD): Updates weights after every


training example. Fast convergence but more noisy.
 Batch Gradient Descent: Updates after processing the entire
dataset. Stable but slow.

 Mini-batch Gradient Descent: Updates weights after processing


batches of samples. Balances speed and stability.

Key Hyperparameters:

 Learning Rate (η)

 Batch Size

 Number of Epochs

 Momentum, Weight Decay

Proper tuning is crucial for optimal performance and avoiding issues like
vanishing gradients or overfitting. (SGD):** Update weights after each
example

 Batch Gradient Descent: Update weights after full pass

 Mini-batch Gradient Descent: Compromise between the two

Hyperparameters like learning rate, batch size, and momentum impact


convergence and stability.

7. Tuning Network Size

Choosing the number of hidden layers and neurons is a critical design choice
in neural networks.

Too Few Parameters:

 Underfitting: The model is too simple to capture underlying patterns

Too Many Parameters:

 Overfitting: The model memorizes training data and performs poorly on


unseen data

Tuning Techniques:

 Cross-validation: Assess performance across data splits

 Early Stopping: Stop training when validation error increases

 Regularization (L1/L2): Penalize large weights


 Dropout: Randomly deactivate neurons during training to promote
generalization

Proper tuning ensures model robustness and better generalization to test


data. is crucial. Too few results in underfitting; too many may cause
overfitting.

Techniques:

 Cross-validation

 Early stopping

 Regularization (L1/L2)

 Dropout

Proper tuning ensures generalization and efficient learning. Overfitting can


be managed using regularization techniques and validation-based early
stopping.

You might also like