0% found this document useful (0 votes)

12 views25 pages

Notes Machine Learning

The document covers key concepts in machine learning, including Bayesian Decision Theory, classification methods, losses and risks, discriminant functions, utility theory, and association rules. It also discusses supervised learning algorithms, non-parametric methods like K-Nearest Neighbors, and dimensionality reduction techniques such as PCA and factor analysis. Each section provides foundational knowledge and practical applications relevant to machine learning and data analysis.

Uploaded by

ShyamShyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Notes Machine Learning

Uploaded by

ShyamShyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning – Detailed Notes

Unit 1: Machine Learning Fundamentals

 Bayesian Decision Theory

 Classification

 Losses and Risks

 Discriminant Functions

 Utility Theory

 Association Rules

1. Bayesian Decision Theory

Bayesian Decision Theory provides a structured way to make decisions under

uncertainty using probabilities. It combines prior knowledge (priors) with
evidence (likelihood) to compute a posterior probability and make decisions
that minimize expected loss.

Bayes’ Theorem:
P(C_k | x) = [P(x | C_k) * P(C_k)] / P(x)
Where:

 P(C_k | x): Posterior probability of class C_k given input x

 P(x | C_k): Likelihood of x under class C_k

 P(C_k): Prior probability of class C_k

 P(x): Marginal probability of x

A classifier chooses the class with the highest posterior probability for a
given input.

2. Classification

Classification is the process of assigning a class label to input data based on

learned decision boundaries. It includes binary and multiclass problems.

Types of classification methods:

 Parametric: These methods assume that the data follows a certain

distribution (like Gaussian) and rely on a fixed number of parameters
to define the model. Once trained, the model uses these parameters to
make predictions. Examples include Logistic Regression and Linear
Discriminant Analysis. They are efficient and require less training data,
but may not perform well if the assumed distribution is incorrect.

 Non-parametric: These methods make no strong assumptions about

the underlying data distribution. Instead, they adapt their structure
based on the training data. They are more flexible and can model
complex patterns, but typically require more data and computation.
Examples include K-Nearest Neighbors (K-NN) and Decision Trees.

Classification is used in spam detection, medical diagnosis, image

recognition, etc.

3. Losses and Risks

Loss represents the cost of making incorrect predictions, while risk is the
expected value of that loss over possible outcomes. These concepts help
guide decision-making in probabilistic classification systems.

Loss Function:
L(i | j): This function defines the penalty incurred when the actual class is j,
but the system predicts class i. For example, misclassifying a spam email as
non-spam may incur a different cost than the reverse.

Expected Risk:
R(α_i | x) = Σ_j L(α_i | y_j) * P(y_j | x)
This calculates the average loss for choosing action α_i (e.g., predicting class
i) given the input x. It accounts for all possible true classes y_j and their
posterior probabilities.

Bayes Risk:
R*(x) = min_i R(α_i | x)
This is the minimum possible risk achievable for input x. The classification
decision rule that always chooses the class with the lowest expected risk is
called the Bayes decision rule.

Real-world example: In medical diagnosis, predicting 'no disease' when

the patient actually has a disease (false negative) could be far more harmful
than the opposite error (false positive). Hence, the loss assigned to a false
negative is much higher. Similarly, in fraud detection, failing to catch a
fraudulent transaction (false negative) may cost the company much more
than mistakenly flagging a valid transaction (false positive). These
considerations directly shape the loss matrix and influence classification
decisions. of making incorrect predictions, while risk is the expected value of
that loss over possible outcomes. These concepts help guide decision-
making in probabilistic classification systems.

In practice, we may define different loss matrices based on how critical each
type of misclassification is. This allows tailoring classifiers to specific
application needs—e.g., false negatives in medical diagnosis may carry much
higher cost than false positives.

4. Discriminant Functions

Discriminant functions compute a score for each class and the class with the
highest score is chosen for classification. These functions form decision
boundaries in the feature space, which separate different classes.

Linear Discriminant Function:

This assumes that the classes share the same covariance matrix and leads to
linear decision boundaries.
g_i(x) = w^T x + w_0
Where:

 x: feature vector

 w: weight vector

 w_0: bias or threshold term

Quadratic Discriminant Function:

Used when each class has its own covariance matrix. This results in
quadratic decision surfaces, allowing for more flexibility in separating
classes.

Discriminant functions are fundamental to methods like Linear Discriminant

Analysis (LDA) and Quadratic Discriminant Analysis (QDA), where they are
derived based on class conditional probabilities and Bayes' rule.

Linear vs. Quadratic Decision Boundaries (Sketch Explanation):

Imagine two classes in a 2D feature space. If the decision boundary is a
straight line, it’s linear (LDA). If the boundary is a curve (e.g., ellipse or
parabola), it’s quadratic (QDA). Linear boundaries are efficient but limited in
flexibility; quadratic boundaries are more adaptive to real-world data
distributions.

A helpful sketch might show:

 A straight line separating two clusters (LDA)

 A curved boundary adjusting to differently shaped clusters (QDA) for

each class and the class with the highest score is chosen for
classification. These functions form decision boundaries in the feature
space, which separate different classes.

Linear Discriminant Function:

This assumes that the classes share the same covariance matrix and leads to
linear decision boundaries.
g_i(x) = w^T x + w_0
Where:

 x: feature vector

 w: weight vector

 w_0: bias or threshold term

Quadratic Discriminant Function:

Used when each class has its own covariance matrix. This results in
quadratic decision surfaces, allowing for more flexibility in separating
classes.

Discriminant functions are fundamental to methods like Linear Discriminant

Analysis (LDA) and Quadratic Discriminant Analysis (QDA), where they are
derived based on class conditional probabilities and Bayes' rule.

Linear Discriminant Function:

g_i(x) = w^T x + w_0
Quadratic Discriminant Function: Used when class covariance matrices
are different.

Used in LDA, QDA, and general linear models.

5. Utility Theory

Utility theory focuses on choosing actions that maximize expected benefit.

Unlike loss-based decision frameworks that minimize cost, utility-based
approaches aim to maximize rewards associated with specific outcomes. This
is particularly useful when outcomes have varying degrees of desirability.

Utility Function: U(i | j) gives the utility of predicting class i when the true
class is j.

For example, in a medical context, detecting a serious disease correctly may

offer very high utility, while a false alarm may only result in minor
inconvenience. By defining utility scores for each outcome, we tailor the
decision-making process to real-world priorities.

Expected Utility:
EU(α_i | x) = Σ_j U(α_i | y_j) * P(y_j | x)

A decision rule that maximizes expected utility can lead to more effective
outcomes in applications where some correct predictions are much more
valuable than others.

Utility Function: U(i | j) gives the utility of predicting class i when the true
class is j.
Expected Utility:
EU(α_i | x) = Σ_j U(α_i | y_j) * P(y_j | x)

Decisions are made to maximize expected utility rather than minimize risk.

6. Association Rules

Association rule mining uncovers relationships among variables in large

datasets. These relationships are expressed as rules that highlight how the
presence of certain items influences the occurrence of others. This is
commonly applied in market basket analysis.

Metrics:

 Support(A → B): The proportion of transactions that contain both A

and B. It indicates how frequently a rule occurs.
 Confidence(A → B): The conditional probability that a transaction
containing A also contains B.

 Lift(A → B): The ratio of observed support to expected support if A

and B were independent. Lift > 1 indicates a positive correlation
between A and B.

Formulas:

 Support(A → B) = P(A ∩ B)

 Confidence(A → B) = P(A ∩ B) / P(A)

 Lift(A → B) = Confidence(A → B) / P(B)

These metrics are used to evaluate the strength and usefulness of rules.

Apriori Algorithm:

1. Identify frequent itemsets using a minimum support threshold.

2. Generate strong rules from these itemsets with high confidence.

Unit 2: Supervised Learning Algorithms & Non-Parametric Methods

1. Supervised Learning Overview

Supervised learning uses labeled datasets to teach models to map inputs to

outputs. The goal is to learn a function that accurately predicts the output
from unseen inputs. It forms the basis for most real-world machine learning
applications.

In this approach, each training example is a pair consisting of an input vector

and a desired output label. The model iteratively adjusts its internal
parameters to minimize the difference between predicted and actual
outputs, often using techniques such as gradient descent.

Applications include:

 Spam email detection

 Medical diagnosis

 Fraud detection

 Sentiment analysis
The typical workflow includes data preprocessing, model training, validation,
testing, and tuning. Algorithms vary from simple (like linear regression) to
complex (like deep neural networks).. It is foundational in applications like
classification and regression.

Examples of algorithms:

 Linear Regression

 Logistic Regression

 Decision Trees

 Support Vector Machines (SVM)

 Neural Networks

The process includes training, validation, and testing phases to ensure

generalization.

2. Histogram Estimator

A histogram estimator approximates the probability density function (PDF) of

a continuous variable by grouping the data into discrete intervals (bins) and
counting the number of observations in each bin.

Formula:
f̂(x) = (Number of points in bin) / (n × h)
Where:

 n: number of total observations

 h: bin width (size of each interval)

This method provides a simple and intuitive way to estimate distributions,

but it is highly sensitive to bin width and bin edge placement. A small h may
lead to overfitting (high variance), while a large h may lead to underfitting
(high bias).

Advantages: Easy to compute, interpretable

Disadvantages: Discontinuous, sensitive to binning choice.

Formula:
f̂(x) = (Number of points in bin) / (n × h)
Where:

 h is bin width
 n is number of samples

It’s simple but sensitive to bin size and alignment.

3. Kernel Estimator

Kernel density estimation (KDE) is a non-parametric method to estimate the

probability density function (PDF) of a random variable. It smooths each data
point using a kernel function.

KDE Formula:
f̂(x) = (1 / nh) Σ K((x - xᵢ) / h)
Where:

 K: Kernel function (e.g., Gaussian, Epanechnikov)

 h: bandwidth (smoothing parameter)

 n: total number of data points

The choice of bandwidth (h) greatly affects the smoothness of the curve:

 Small h → overfitting

 Large h → oversmoothing

Advantages: Smooth estimate, more accurate than histograms

Disadvantages: Computationally expensive, bandwidth selection is non-
trivial, kernel estimators use a kernel function to weigh nearby points.

Kernel Density Estimate (KDE) Formula:

f̂(x) = (1 / nh) Σ K((x - xᵢ) / h)
Where:

 K: Kernel function (e.g., Gaussian)

 h: bandwidth parameter

KDE is more flexible and continuous but computationally intensive.

4. K-Nearest Neighbor (K-NN)

K-NN is a non-parametric, instance-based learning algorithm. It classifies a

data point based on the labels of its k-nearest neighbors in the training set.

Distance Metric (Euclidean):

d(x, y) = √Σ(xᵢ - yᵢ)²
Classification Rule:
Assign the label most common among the k closest samples.

Characteristics:

 No training phase (lazy learner)

 Stores all data points

 Performs well with locally clustered data

Challenges:

 Sensitive to irrelevant or redundant features

 Requires scaling of features

 Suffers in high-dimensional spaces (curse of dimensionality) based on

the majority label among its k-nearest neighbors.

Distance Metric (Euclidean):

d(x, y) = √Σ(xᵢ - yᵢ)²

Classification Rule:
Assign the class most common among the k closest training samples.

K-NN is simple and effective but scales poorly with large datasets and high
dimensionality.

Unit 3: Dimensionality Reduction

1. Introduction

Dimensionality reduction simplifies datasets by reducing the number of input

variables. It helps mitigate the curse of dimensionality, improves
computational efficiency, and enhances visualization and interpretability. By
focusing on the most relevant features, it can improve the performance of
learning algorithms and reduce overfitting.

There are two broad types of dimensionality reduction:

 Feature Selection: Selecting a subset of original features.

 Feature Extraction: Creating new features from transformations of

original ones (e.g., PCA).

It is especially important in high-dimensional spaces, such as image data,

gene expression datasets, or natural language processing. by reducing the
number of input variables. It helps mitigate the curse of dimensionality,
improves computational efficiency, and can enhance visualization and
interpretability.

2. Subset Selection

This technique selects a subset of the original variables based on criteria

such as correlation with the output or information gain. Methods include
forward selection, backward elimination, and recursive feature elimination.

3. Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique that identifies the

directions (principal components) in which the data varies the most. It
transforms correlated variables into a smaller number of uncorrelated
variables while preserving as much variability as possible.

Steps:

1. Standardize the data (mean = 0, variance = 1)

2. Compute the covariance matrix

3. Calculate eigenvalues and eigenvectors of the covariance matrix

4. Sort eigenvectors by eigenvalue magnitude

5. Project original data onto top k eigenvectors

PCA Formula:
Z = XW
Where:

 X: mean-centered data matrix

 W: matrix of selected eigenvectors (principal components)

 Z: projected data

Advantages:

 Reduces redundancy in features

 Speeds up learning algorithms

 Helps in data visualization

Disadvantages:

 Components may be hard to interpret

 Only captures linear relationships into a new set of uncorrelated
variables (principal components) ordered by variance.

Steps:

1. Standardize the data

2. Compute the covariance matrix

3. Calculate eigenvalues and eigenvectors

4. Project data onto principal components

PCA Formula:
Z = XW
Where:

 X: mean-centered data

 W: eigenvectors

 Z: transformed data

4. Factor Analysis

Factor analysis is a statistical method used to explain variability among

observed variables in terms of fewer unobserved latent variables called
factors. It assumes that each observed variable is influenced by common
factors and unique error components.

Model:
X = LF + ε
Where:

 X: observed variables (vector)

 L: loading matrix (weights mapping factors to variables)

 F: latent common factors (vector)

 ε: specific errors or noise

Applications:

 Psychometrics (e.g., intelligence testing)

 Social sciences (e.g., personality factors)

Advantages:
 Reveals hidden structures

 Reduces complexity in correlated data

Disadvantages:

 Assumes linear relationships

 Sensitive to data scaling and initial assumptions using fewer

unobserved latent variables (factors). It assumes that variance comes
from both common factors and unique noise components.

Model:
X = LF + ε
Where:

 X: observed variables

 L: loading matrix

 F: factors

 ε: noise

5. Singular Value Decomposition (SVD)

SVD is a powerful matrix decomposition technique used in signal processing,

statistics, and machine learning. It decomposes any m×n matrix A into three
matrices:

Formula:
A = UΣVᵀ
Where:

 A: original data matrix

 U: matrix of left singular vectors

 Σ: diagonal matrix of singular values

 Vᵀ: transpose of matrix of right singular vectors

Applications:

 Image compression

 Noise reduction

 Latent Semantic Analysis (LSA) in NLP

Advantages:

 Works on any matrix (square or rectangular)

 Captures important patterns in data

Disadvantages:

 Computationally expensive

 Hard to interpret singular vectors into three components:

A = UΣVᵀ
Where:

 A: original matrix

 U, V: orthogonal matrices

 Σ: diagonal matrix of singular values

Used in dimensionality reduction and text processing (e.g., Latent Semantic

Indexing).

6. Matrix Factorization

Matrix factorization approximates a matrix as the product of two smaller

matrices. It is widely used in recommender systems (e.g., collaborative
filtering).

Model:
R ≈ P × Qᵀ
Where:

 R: rating matrix

 P: user features

 Q: item features

7. Multidimensional Scaling (MDS)

MDS is a set of techniques that visualize the level of similarity or dissimilarity

of data in lower dimensions. It attempts to place each object in N-
dimensional space such that the between-object distances are preserved as
well as possible.

Goal:
Given a distance matrix D, find a configuration of points X such that their
pairwise Euclidean distances approximate D.
Types:

 Metric MDS: Preserves actual distances

 Non-metric MDS: Preserves rank order of distances

Applications:

 Psychology (perceptual maps)

 Marketing (product similarity)

 Bioinformatics (gene expression)

Limitations:

 Sensitive to noise

 Less effective with non-Euclidean or sparse distance matrices while

preserving pairwise distances. It's especially useful for visualizing
similarities or dissimilarities between samples.

Objective:
Minimize the difference between original and projected distances.

8. Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction method that finds linear

combinations of features that best separate two or more classes. Unlike PCA,
which focuses on variance, LDA maximizes the separability between classes.

Objective Function:
Maximize J(w) = (wᵀ S_B w) / (wᵀ S_W w)
Where:

 S_B: Between-class scatter matrix

 S_W: Within-class scatter matrix

 w: projection vector

Steps:

1. Compute class means and overall mean

2. Compute S_W and S_B

3. Solve the generalized eigenvalue problem

4. Select top eigenvectors as projection axes

Applications:

 Face recognition

 Medical diagnosis

 Text classification

Advantages:

 Works well when class distributions are Gaussian

 Improves class separability

Disadvantages:

 Assumes equal class covariances

 Sensitive to outliers of features that best separates classes.

Objective Function:
Maximize J(w) = (wᵀ S_B w) / (wᵀ S_W w)
Where:

 S_B: between-class scatter

 S_W: within-class scatter

LDA is supervised and complements PCA by optimizing class separability

instead of variance.

Unit 4: Unsupervised Learning

1. Introduction

Unsupervised learning algorithms uncover hidden patterns or intrinsic

structures in data without labeled responses. These methods explore the
data’s internal structure and groupings without relying on known outcomes
or supervisory signals.

It is especially useful when labeled data is unavailable or too expensive to

obtain. The two main goals are:

 Clustering: Grouping similar data points into clusters.

 Dimensionality Reduction: Reducing the number of features while

preserving structure.

Common applications include market segmentation, anomaly detection,

social network analysis, and gene expression profiling. or intrinsic structures
in data without labeled responses. Clustering is a primary form of
unsupervised learning, often used in exploratory data analysis.

2. Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters using

agglomerative (bottom-up) or divisive (top-down) approaches. It doesn’t
require specifying the number of clusters in advance and provides flexibility
in deciding the level of clustering by cutting the dendrogram at a desired
height.

Agglomerative Clustering Steps:

1. Start with each data point as its own cluster.

2. Compute pairwise distances.

3. Merge the closest pair of clusters based on a linkage method.

4. Repeat until one cluster remains or desired number is reached.

Linkage Criteria:

 Single Linkage: Minimum distance between any two points in clusters

 Complete Linkage: Maximum distance between any two points in

clusters

 Average Linkage: Mean distance between all points across two

clusters

Distance Function (Euclidean):

d(x, y) = √Σ(xᵢ - yᵢ)²

Advantages:

 Easy to understand and visualize

 No need to pre-specify the number of clusters

Disadvantages:

 Sensitive to outliers

 Time complexity is higher than partitional methods (dendrogram) of

nested clusters using agglomerative (bottom-up) or divisive (top-down)
approaches.

Agglomerative Clustering Steps:

1. Start with each data point as its own cluster.

2. Merge the two closest clusters.

3. Repeat until all points are in one cluster.

Linkage Criteria:

 Single Linkage: min distance between elements

 Complete Linkage: max distance between elements

 Average Linkage: average pairwise distance

Distance Function (Euclidean):

d(x, y) = √Σ(xᵢ - yᵢ)²

3. Partitional Clustering

Partitional clustering methods divide data into a fixed number of k non-

overlapping clusters, optimizing a criterion such as intra-cluster distance. It
assumes each data point belongs to one cluster.

This approach is computationally more scalable than hierarchical clustering

and is widely used in large datasets.

Forgy's Algorithm

Forgy's method initializes cluster centroids by selecting k random data

points. Points are then assigned to the nearest centroid, and centroids are
recalculated. This process iterates until convergence.

K-Means Algorithm

K-Means is one of the most commonly used partitional clustering algorithms.

Steps:

1. Choose the number of clusters k

2. Randomly initialize k centroids

3. Assign each data point to the nearest centroid

4. Update centroids by computing the mean of assigned points

5. Repeat steps 3–4 until centroids stabilize

J = Σ_k Σ_{xᵢ ∈ C_k} ||xᵢ - μ_k||²
Objective Function:

Where:

 xᵢ is a data point in cluster C_k

 μ_k is the centroid of cluster C_k

Advantages:

 Simple and fast

 Works well when clusters are spherical and equally sized

Disadvantages:

 Requires specifying k in advance

 Sensitive to initial centroid placement

 Not suitable for non-spherical clusters or data with varying densities

into k non-overlapping subsets (clusters).

Forgy's Algorithm

Randomly initialize k centroids and assign each point to the nearest centroid.

K-Means Algorithm

K-Means minimizes the sum of squared distances between data points and
their assigned cluster centers.

Steps:

1. Choose k initial centroids.

2. Assign each point to the nearest centroid.

3. Recalculate centroids as the mean of assigned points.

4. Repeat until convergence.

Objective Function:
J = Σ Σ ||xᵢ - μ_k||²
Where:

 xᵢ is a data point assigned to cluster k

 μ_k is the centroid of cluster k

K-means assumes spherical clusters and is sensitive to initialization.

Unit 5: Multilayer Perceptron and Backpropagation

1. The Perceptron

A perceptron is a basic neural network unit that models a single artificial

neuron. It computes a weighted sum of the inputs and passes it through an
activation function (usually a step function) to produce a binary output.

Mathematical Representation:
y = f(w·x + b)
Where:

 x = input vector

 w = weight vector

 b = bias term

 f = activation function (e.g., step or sign function)

Functionality:

 If w·x + b ≥ 0 → output is 1

 Otherwise → output is 0

Perceptrons are suitable for linearly separable problems like AND, OR but fail
with non-linearly separable cases like XOR. that computes a weighted sum of
inputs and applies an activation function (typically a step function). It's the
simplest model of a neuron.

Perceptron Function:
y=f(w⋅x+b)y = f(w \cdot x + b)
Where:

 xx: input vector

 ww: weight vector

 bb: bias

 ff: activation function (e.g., sign or step)

2. Training a Perceptron

The perceptron learns through a supervised learning rule that updates

weights based on the classification error. The objective is to find weights that
minimize the number of misclassifications.
Perceptron Learning Rule:
w_i ← w_i + η (y - ŷ) x_i
Where:

 w_i: weight for input x_i

 η: learning rate (0 < η < 1)

 y: actual class label

 ŷ: predicted label

Procedure:

1. Initialize weights and bias

2. For each training sample:

o Compute output

o Compare with true label

o Update weights if incorrect

3. Repeat until convergence or max iterations to minimize classification

error. The perceptron learning rule updates weights based on the
difference between predicted and actual output.

Weight Update Rule:

wi=wi+η(y−y^)xiw_i = w_i + \eta (y - \hat{y}) x_i
Where η\eta is the learning rate.

3. Learning Boolean Functions

Perceptrons are capable of learning simple Boolean functions that are

linearly separable.

Examples:

 AND: Output is 1 only when both inputs are 1

 OR: Output is 1 if at least one input is 1

 NOT: A unary function returning the opposite value

However, they cannot learn XOR because the data is not linearly separable.
This limitation led to the development of multilayer perceptrons which can
model non-linear decision boundaries. Boolean functions like AND and OR,
but not XOR due to its non-linear separability.
Example:

 AND: output is 1 only if both inputs are 1

 OR: output is 1 if at least one input is 1

 XOR: not linearly separable (requires multilayer network)

4. Multilayer Perceptrons (MLP)

MLPs consist of multiple layers of neurons: input layer, one or more hidden
layers, and an output layer. Each neuron in one layer is connected to all
neurons in the next layer.

Architecture:

 Input Layer: Receives input features

 Hidden Layer(s): Learns intermediate representations using non-

linear activation functions (ReLU, sigmoid, tanh)

 Output Layer: Produces predictions (e.g., softmax for classification)

MLPs can learn complex patterns and approximate any continuous function
(universal approximation theorem).

Advantages:

 Handles non-linearly separable data

 Highly expressive with enough hidden units, and output layers with
neurons using activation functions like sigmoid or ReLU. It can model
complex, non-linear decision boundaries.

It overcomes the limitations of single-layer perceptrons by introducing

hidden layers that enable hierarchical feature learning.

5. Backpropagation Algorithm

Backpropagation is a supervised learning algorithm used for training MLPs. It

computes gradients of the loss function with respect to each weight by the
chain rule and updates the weights to minimize the loss.

Steps:

1. Forward Pass: Compute outputs layer-by-layer

2. Compute Error: Loss = Actual - Predicted

3. Backward Pass: Compute gradients using derivatives of activation
functions

4. Weight Update: Apply gradient descent

Gradient Descent Formula:

w ← w - η ∂E/∂w
Where:

 η: learning rate

 E: error function

 ∂E/∂w: partial derivative of error w.r.t. weight

Loss Functions:

 Mean Squared Error (MSE)

 Cross-Entropy (for classification)

Example:
Suppose we have a simple neural network with one input neuron, one output
neuron, and no hidden layer. Let the input x = 1.0, the target output t = 0.0,
initial weight w = 0.5, and learning rate η = 0.1. Let the activation function
be identity (f(x) = x).

1. Forward pass:

o Output y = x * w = 1.0 * 0.5 = 0.5

2. Compute error:

o E = (1/2)(t - y)^2 = (1/2)(0.0 - 0.5)^2 = 0.125

3. Backward pass (gradient):

o dE/dw = (y - t) * x = (0.5 - 0.0) * 1.0 = 0.5

4. Update weight:

o w_new = w - η * dE/dw = 0.5 - 0.1 * 0.5 = 0.45

This demonstrates one iteration of backpropagation updating the weight to

reduce error. used for training MLPs. It computes gradients of the loss
function with respect to each weight by the chain rule and updates the
weights to minimize the loss.

Steps:
1. Forward Pass: Compute outputs layer-by-layer

2. Compute Error: Loss = Actual - Predicted

3. Backward Pass: Compute gradients using derivatives of activation

functions

4. Weight Update: Apply gradient descent

Gradient Descent Formula:

w ← w - η ∂E/∂w
Where:

 η: learning rate

 E: error function

 ∂E/∂w: partial derivative of error w.r.t. weight

Loss Functions:

 Mean Squared Error (MSE)

 Cross-Entropy (for classification). It minimizes the error by computing

gradients and updating weights via gradient descent.

Backpropagation Steps:

1. Forward Pass: Compute output using current weights.

2. Compute Error: Compare predicted and actual values.

3. Backward Pass: Propagate the error backward using derivatives.

4. Update Weights: Use gradient descent to minimize the error.

Gradient Descent Update Rule:

w=w−η∂E∂ww = w - \eta \frac{\partial E}{\partial w}
Where EE is the error function.

6. Training Procedures

Training procedures vary based on how frequently the weights are updated.

Methods:

 Stochastic Gradient Descent (SGD): Updates weights after every

training example. Fast convergence but more noisy.
 Batch Gradient Descent: Updates after processing the entire
dataset. Stable but slow.

 Mini-batch Gradient Descent: Updates weights after processing

batches of samples. Balances speed and stability.

Key Hyperparameters:

 Learning Rate (η)

 Batch Size

 Number of Epochs

 Momentum, Weight Decay

Proper tuning is crucial for optimal performance and avoiding issues like
vanishing gradients or overfitting. (SGD):** Update weights after each
example

 Batch Gradient Descent: Update weights after full pass

 Mini-batch Gradient Descent: Compromise between the two

Hyperparameters like learning rate, batch size, and momentum impact

convergence and stability.

7. Tuning Network Size

Choosing the number of hidden layers and neurons is a critical design choice
in neural networks.

Too Few Parameters:

 Underfitting: The model is too simple to capture underlying patterns

Too Many Parameters:

 Overfitting: The model memorizes training data and performs poorly on

unseen data

Tuning Techniques:

 Cross-validation: Assess performance across data splits

 Early Stopping: Stop training when validation error increases

 Regularization (L1/L2): Penalize large weights

 Dropout: Randomly deactivate neurons during training to promote
generalization

Proper tuning ensures model robustness and better generalization to test

data. is crucial. Too few results in underfitting; too many may cause
overfitting.

Techniques:

 Cross-validation

 Early stopping

 Regularization (L1/L2)

 Dropout

Proper tuning ensures generalization and efficient learning. Overfitting can

be managed using regularization techniques and validation-based early
stopping.

Machine Learning Syllabus
No ratings yet
Machine Learning Syllabus
26 pages
Machine Learning Overview & SVMs
No ratings yet
Machine Learning Overview & SVMs
378 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
PR 2 Unit
No ratings yet
PR 2 Unit
13 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
DM Assignment 2
No ratings yet
DM Assignment 2
23 pages
Unit 2
No ratings yet
Unit 2
11 pages
MachineLearning Chatgpt
No ratings yet
MachineLearning Chatgpt
19 pages
Pattern Recognition Techniques
No ratings yet
Pattern Recognition Techniques
10 pages
Supervised vs. Unsupervised Learning
No ratings yet
Supervised vs. Unsupervised Learning
7 pages
Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
Machine Learning - I
No ratings yet
Machine Learning - I
126 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Unit 4 DS
No ratings yet
Unit 4 DS
16 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Intro to Machine Learning Basics
100% (1)
Intro to Machine Learning Basics
52 pages
Textbook
No ratings yet
Textbook
161 pages
Cs181 Textbook
No ratings yet
Cs181 Textbook
163 pages
Super VIP Cheatsheet: AI Overview
No ratings yet
Super VIP Cheatsheet: AI Overview
18 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
8 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Predictive Unit 1
No ratings yet
Predictive Unit 1
22 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
Undergraduate Fundamentals of Machine Learning
No ratings yet
Undergraduate Fundamentals of Machine Learning
163 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Machine Learning Concepts and Formulas
No ratings yet
Machine Learning Concepts and Formulas
107 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
UNIT 3 Data Warehousing
No ratings yet
UNIT 3 Data Warehousing
39 pages
Optimization Problems For Machine Learning: A Survey
No ratings yet
Optimization Problems For Machine Learning: A Survey
41 pages
AI ML Concepts
No ratings yet
AI ML Concepts
97 pages
Notes
No ratings yet
Notes
35 pages
Logistic Regression and Classifiers Overview
No ratings yet
Logistic Regression and Classifiers Overview
10 pages
Classification Personal
No ratings yet
Classification Personal
36 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
ML Notes
No ratings yet
ML Notes
16 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Analysis of Introduction To Machine Learning, Second Edition (Adaptive Computation and Machine Learning)
No ratings yet
Analysis of Introduction To Machine Learning, Second Edition (Adaptive Computation and Machine Learning)
3 pages
ML Imp Ques 1
No ratings yet
ML Imp Ques 1
22 pages
Machine Learning II
No ratings yet
Machine Learning II
61 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Data Science & ML Techniques
No ratings yet
Data Science & ML Techniques
113 pages
Mod8 DM
No ratings yet
Mod8 DM
13 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
Machine Learning Course Notes
No ratings yet
Machine Learning Course Notes
112 pages
Unit 2
No ratings yet
Unit 2
18 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Silva Et Al. (2009)
No ratings yet
Silva Et Al. (2009)
4 pages
(Biomedical Signal and Image Processing) Ganesh R. Naik, Wellington Pinheiro Dos Santos - Biomedical Signal Processing. A Modern Approach-CRC Press (2024)
100% (3)
(Biomedical Signal and Image Processing) Ganesh R. Naik, Wellington Pinheiro Dos Santos - Biomedical Signal Processing. A Modern Approach-CRC Press (2024)
294 pages
Comparison of Common Machine Learning Models For Classification of Tuberculosis Using Transcriptional Biomarkers From Integrated Datasets PDF
No ratings yet
Comparison of Common Machine Learning Models For Classification of Tuberculosis Using Transcriptional Biomarkers From Integrated Datasets PDF
10 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
4 pages
Module - II
No ratings yet
Module - II
8 pages
Parâmetros Que Medem o Risco Bancário e Sua Estimação
No ratings yet
Parâmetros Que Medem o Risco Bancário e Sua Estimação
14 pages
Technical Interview Preparation Guide
No ratings yet
Technical Interview Preparation Guide
28 pages
Feature Extraction in Machine Learning
No ratings yet
Feature Extraction in Machine Learning
19 pages
Ma5160 Applied Probability and Statistics: For Syllabus, Question Papers, Notes & Many More
100% (1)
Ma5160 Applied Probability and Statistics: For Syllabus, Question Papers, Notes & Many More
2 pages
XAI For Precision Pathology, 18
No ratings yet
XAI For Precision Pathology, 18
30 pages
Cbe Addis North South Mesfin Lemma
No ratings yet
Cbe Addis North South Mesfin Lemma
25 pages
Experiment 22MIM10088
No ratings yet
Experiment 22MIM10088
36 pages
BITS Pilani
No ratings yet
BITS Pilani
31 pages
Osgood - Semantic Dimentional Technique in The Comparative Study of Cultures
No ratings yet
Osgood - Semantic Dimentional Technique in The Comparative Study of Cultures
30 pages
10.1007@978 0 387 39351 3 PDF
100% (2)
10.1007@978 0 387 39351 3 PDF
316 pages
Soguero-Ruiz Et Al (2020) - Finding Associations Among Chronic Conditions by Bootstrap and Multiple Correspondence Analysis
No ratings yet
Soguero-Ruiz Et Al (2020) - Finding Associations Among Chronic Conditions by Bootstrap and Multiple Correspondence Analysis
8 pages
Development of An Index in Social Science A System
No ratings yet
Development of An Index in Social Science A System
16 pages
B.tech .AIT AIT 411 Data Mining and Business Intelligence 20 Compressed
No ratings yet
B.tech .AIT AIT 411 Data Mining and Business Intelligence 20 Compressed
140 pages
Global Ecology and Biogeography - 2025 - Irimia - Cross Continental Shifts of Ecological Strategy in A Global Plant Invader
No ratings yet
Global Ecology and Biogeography - 2025 - Irimia - Cross Continental Shifts of Ecological Strategy in A Global Plant Invader
16 pages
AIML Question Ans Part2
No ratings yet
AIML Question Ans Part2
25 pages
BI Bankai
No ratings yet
BI Bankai
27 pages
Aspiring Data Scientist's Portfolio
No ratings yet
Aspiring Data Scientist's Portfolio
1 page
5 Acoustic Settings Combination As A Sensory Crispness Indicator - 10p
No ratings yet
5 Acoustic Settings Combination As A Sensory Crispness Indicator - 10p
10 pages
ASM, Image Search N Classification-2
No ratings yet
ASM, Image Search N Classification-2
4 pages
Assessment of Vanadium in Stream Sediments From River Mbete, Loum Area (Pan-African Fold Belt, Cameroon) : Implications For Vanadium Exploration
No ratings yet
Assessment of Vanadium in Stream Sediments From River Mbete, Loum Area (Pan-African Fold Belt, Cameroon) : Implications For Vanadium Exploration
14 pages
Climate Change and Human Health: Estimating District Level Health Vulnerabilities in The Indian Context
No ratings yet
Climate Change and Human Health: Estimating District Level Health Vulnerabilities in The Indian Context
20 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
A Survey On Learning To Hash
No ratings yet
A Survey On Learning To Hash
22 pages
Unsupervised Learning Techniques in Python
100% (2)
Unsupervised Learning Techniques in Python
89 pages
Handouts An Overview of PEAK
No ratings yet
Handouts An Overview of PEAK
29 pages