Unit 1 Notes Final
Unit 1 Notes Final
1. Artificial Neurons
Artificial neurons are inspired by biological neurons in the human brain. They
are the fundamental building blocks of neural networks.
Structure of an Artificial Neuron:
● Inputs (x1, x2, ..., xn): Features or data inputs.
● Weights (w1, w2, ..., wn): Adjustable parameters that influence input
importance.
● Bias (b): An additional parameter to adjust the output.
● Activation Function: Applies a non-linear transformation to the input.
● Output (y): Computed result of the neuron.
Mathematical Model:
2. Perceptron
The perceptron is the simplest type of artificial neural network that can be used
for binary classification problems.
Characteristics:
● Consists of a single layer of neurons.
● Uses the step function as an activation function.
● Can solve linearly separable problems.
Perceptron Learning Rule:
The Perceptron Learning Rule is a supervised learning algorithm used to train a
single-layer perceptron model, where the goal is to classify inputs into one of
two classes (binary classification). It is based on adjusting the weights of the
perceptron to minimize the classification error.
● The perceptron consists of an input layer, weights, a bias term, and an
activation function (usually a step function).
● Input: The perceptron receives inputs x1,x2,…,xn (features).
● Weights: The perceptron has weights w1,w2,…,wn for each input, and a
bias b
● Activation Function: The output of the perceptron is determined by
applying an activation function (commonly a step function) to the
weighted sum of inputs.
The output y of the perceptron is given by:
The goal is to adjust the weights w1,w2,…,wn and the bias b such that the
perceptron correctly classifies the training data points. The Perceptron Learning
Rule is an iterative algorithm that updates the weights based on the
classification error.
The perceptron learning rule updates the weights and bias after each
misclassification. The update rule is as follows:
For each training example (x(i),y(i)), where:
A.x(i) is the input vector,
B. y(i) is the true label (either 0 or 1),
If the perceptron makes an incorrect prediction, it updates the weights and
bias:
● If the perceptron predicts 0 but the true label is 1 (false negative), the
weights are increased:
If the perceptron predicts 1 but the true label is 0 (false positive), the weights
are decreased:
where:
● η is the learning rate, which controls the magnitude of the update,
● y^(i) is the predicted output of the perceptron for the current example,
● y(i) is the actual output (true label).
Consider a simple binary classification problem where you want to classify
two classes: class 0 and class 1. Suppose you have the following dataset:
x1 x2 True Label y
0 0 0
0 1 0
1 0 0
1 1 1
3. If the predicted output does not match the true label y, update the
weights and bias:
The predicted output =1, but the true label y=0, so there is a
misclassification.
Update the weights and bias:
The predicted output =0, and the true label y=0, so no update is needed.
Updated values: w1=0, w2=0, b=−1 (no change)
The predicted output =0, and the true label y=0, so no update is needed.
Updated values: w1=0, w2=0, b=−1 (no change)
The predicted output =0, but the true label y=1, so there is a
misclassification.
Update the weights and bias:
where wi are the weights, xiare the inputs, and θ is the threshold.
B. Hebbian Learning Model: Learning based on synaptic strength. The
Hebbian Learning Model is based on the idea of learning through
reinforcement of connections between neurons that are activated together.
This principle, often summarized as “cells that fire together, wire
together,” was first proposed by psychologist Donald Hebb in 1949. It is a
biologically-inspired learning rule where the strength of a synaptic
connection between two neurons increases if both neurons are active at
the same time.
Components:
Pre-synaptic Neuron: The neuron sending the signal.
Post-synaptic Neuron: The neuron receiving the signal.
Synaptic Strength: The weight or strength of the connection between
neurons, which adjusts based on the activity of the neurons.
Mathematical Representation: The Hebbian learning rule is typically
described as:
Δw = η x ⋅ y
where:
● w is the synaptic weight,
● η is the learning rate,
● x is the activity of the pre-synaptic neuron,
● y is the activity of the post-synaptic neuron.
C. Spiking Neuron Model: Simulates real biological neuron firing patterns.
A spiking neuron model is a mathematical description of how neurons
fire electrical signals. These models are used to simulate the behavior of
biological neurons in the brain.
How does a spiking neuron model work?
1. The neuron receives input information as a series of spikes over time.
2. The neuron's membrane potential increases as it accumulates input.
3. When the membrane potential exceeds a threshold, the neuron fires a
spike.
4. The neuron's state variable is reset to a lower value after firing.
The most fundamental kind of neural network, in which input data travels only
in one way before leaving through output nodes and passing through artificial
neural nodes. Input and output layers are present in locations where hidden
layers may or may not be present. Based on this, they are further divided into
single-layered and multi-layered feed-forward neural networks. The information
only flows forward in the neural network, first through the input nodes, then
through the hidden layers (single or many layers), and ultimately through the
output nodes, which is why this network of models is termed feedforward.
It cannot spread backward; it can only go forward.
Multi-layer Feed Forward Neural Network: This Neural Network or
Artificial Neural Network has multiple hidden layers that make it a multilayer
neural Network and it is feed-forward because it is a network that follows a
top-down approach to train the network. In this network there are the following
layers:
1. Input Layer: It is starting layer of the network that has a weight
associated with the signals.
2. Hidden Layer: This layer lies after the input layer and contains multiple
neurons that perform all computations and pass the result to the output
unit.
3. Output Layer: It is a layer that contains output units or neurons and
receives processed data from the hidden layer, if there are further hidden
layers connected to it then it passes the weighted unit to the connected
hidden layer for further processing to get the desired result.
Inputs are multiplied by weights and supplied into the activation function,
where they are adjusted to minimize loss during backpropagation. Weights are
just machine-learned values from Neural Networks. They modify themselves
based on the gap between projected and training outcomes. Softmax is used as
an output layer activation function after nonlinear activation functions.
Application of Multilayer Feed-Forward Neural Network:
1. Medical field
2. Speech regeneration
3. Data processing and compression
4. Image processing
Limitations:
This ANN is a basic form of Neural Network that has no cycles and computes
only in the forward direction. It has some limitations like sometimes
information about the neighborhood is lost and in that case, it becomes difficult
to process further all steps are needed to be performed again and it does not
support back propagation so the network cannot learn or correct the fault of the
previous stage.
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity
into the model.
Advantages of Backpropagation for Neural Network Training
The key benefits of using the backpropagation algorithm are:
● Ease of Implementation: Backpropagation is beginner-friendly,
requiring no prior neural network knowledge, and simplifies
programming by adjusting weights via error derivatives.
● Simplicity and Flexibility: Its straightforward design suits a range of
tasks, from basic feedforward to complex convolutional or recurrent
networks.
● Efficiency: Backpropagation accelerates learning by directly updating
weights based on error, especially in deep networks.
● Generalization: It helps models generalize well to new data, improving
prediction accuracy on unseen examples.
● Scalability: The algorithm scales efficiently with larger datasets and
more complex networks, making it ideal for large-scale tasks.
Challenges with Backpropagation
While backpropagation is powerful, it does face some challenges:
1. Vanishing Gradient Problem: In deep networks, the gradients can
become very small during backpropagation, making it difficult for the
network to learn. This is common when using activation functions like
sigmoid or tanh.
2. Exploding Gradients: The gradients can also become excessively large,
causing the network to diverge during training.
3. Overfitting: If the network is too complex, it might memorize the
training data instead of learning general patterns.
Gradient-Based Learning
Gradient descent is the backbone of the learning process for various algorithms,
including linear regression, logistic regression, support vector machines, and
neural networks which serves as a fundamental optimization technique to
minimize the cost function of a model by iteratively adjusting the model
parameters to reduce the difference between predicted and actual values,
improving the model’s performance.
Gradient descent is an optimization algorithm that’s used when training a
machine learning model. It’s based on a convex function and tweaks its
parameters iteratively to minimize a given function to its local minimum.
A gradient simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function. The higher
the gradient, the steeper the slope and the faster a model can learn. But if the
slope is zero, the model stops learning. In mathematical terms, a gradient is a
partial derivative with respect to its inputs.
Imagine a blindfolded man who wants to climb to the top of a hill with the
fewest steps possible. He might start climbing the hill by taking really big steps
in the steepest direction. But as he comes closer to the top, his steps will get
smaller and smaller to avoid overshooting it.
Assume the image illustrates our hill from a top-down view and the red arrows
are the steps of our climber. A gradient in this context is a vector that contains
the direction of the steepest step the blindfolded man can take and how long that
step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one
reaching from X3 to X4. This is because the steepness/slope of the hill, which
determines the length of the vector, is less. This perfectly represents the
example of the hill because the hill is getting less steep the higher it’s climbed,
so a reduced gradient goes along with a reduced slope and a reduced step size
for the hill climber.
How Does Gradient Descent Work?
Now Instead of climbing up a hill, think of gradient descent as hiking down to
the bottom of a valley. The equation below describes what the gradient descent
algorithm does: b is the next position of our climber, while a represents his
current position. The minus sign refers to the minimization part of the gradient
descent algorithm. The gamma in the middle is a waiting factor and the gradient
term ( Δf(a) ) is simply the direction of the steepest descent.
Imagine you have a machine learning problem and want to train your algorithm
with gradient descent to minimize your cost-function J(w, b) and reach its local
minimum by tweaking its parameters (w and b). The image below shows the
horizontal axes representing the parameters (w and b), while the cost
function J(w, b) is represented on the vertical axes.
We want to find the values of w and b that correspond to the minimum of the
cost function (marked with the red arrow). To start, we initialize w and b with
some random numbers. Gradient descent then starts at that point (somewhere
around the top of our illustration), and it takes one step after another in the
steepest downside direction (i.e., from the top to the bottom of the illustration)
until it reaches the point where the cost function is as small as possible.
Types of Gradient Descent
There are three popular types of gradient descent that mainly differ in the
amount of data they use — batch, stochastic and mini-batch.
Batch Gradient Descent
Batch gradient descent, also called vanilla gradient descent, calculates the error
for each example within the training dataset, but it only gets updated after all
training examples have been evaluated. This process is like a cycle and called a
training epoch.
An advantage of batch gradient descent is its computational efficiency: it
produces a stable error gradient and a stable convergence. But the stable error
gradient can sometimes result in a state of convergence that isn’t the best the
model can achieve. It also requires the entire training dataset to be in memory
and available to the algorithm.
Stochastic Gradient Descent
By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each training
example one by one. Depending on the problem, this can make SGD faster than
batch gradient descent. One advantage is that frequent updates allow us to have
a pretty detailed rate of improvement.
Mini-Batch Gradient Descent
Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of SGD and batch gradient descent. It simply splits the training dataset
into small batches and performs an update for each of those batches. This
creates a balance between the robustness of stochastic gradient descent and the
efficiency of batch gradient descent.
How big the steps gradient descent takes in the direction of the local
minimum is determined by the learning rate, which figures out how fast or slow
we will move towards the optimal weights.
For the gradient descent algorithm to reach the local minimum, we must set the
learning rate to an appropriate value, which is neither too low nor too high. This
is important because if the steps it takes are too big, it may not reach the local
minimum because it bounces back and forth between the convex function of
gradient descent (see left image below). If we set the learning rate to a very
small value, gradient descent will eventually reach the local minimum but that
may take a while (see the right image).
Challenges with gradient descent
● Local minima and saddle points: For convex problems, gradient descent
can find the global minimum with ease, but as nonconvex problems
emerge, gradient descent can struggle to find the global minimum, where
the model achieves the best results. Recall that when the slope of the cost
function is at or close to zero, the model stops learning.
● Vanishing and Exploding Gradients:
Vanishing gradients: This occurs when the gradient is too small. As we
move backwards during backpropagation, the gradient continues to
become smaller, causing the earlier layers in the network to learn more
slowly than later layers. When this happens, the weight parameters update
until they become insignificant—i.e. 0—resulting in an algorithm that is
no longer learning.
Exploding gradients: This happens when the gradient is too large,
creating an unstable model. In this case, the model weights will grow too
large, and they will eventually be represented as NaN. One solution to
this issue is to leverage a dimensionality reduction technique, which can
help to minimize complexity within the model.
o Formula:
2. Mean Absolute Error (MAE)
o Another regression loss function, but less sensitive to outliers than
MSE.
o Formula:
o Formula:
Regularization in Cost Functions
The total cost function used for training a neural network often includes a
regularization term to prevent overfitting.
● L2 Regularization (Weight Decay): λ ∑ w2
o Penalizes large weights to improve generalization.
● L1 Regularization (Lasso): λ ∑ ∣w∣
o Encourages sparse models by driving some weights to zero.
More advanced regularization strategies, such as dropout and batch
normalization, are discussed later.
This function measures how well the model’s predicted probability distribution
pmodel(y∣x) matches the true data distribution p^data(y∣x). By minimizing this
function, we maximize the likelihood that the model assigns to the observed
training data.
Cost Function Variability Across Models
The exact form of the cost function depends on the type of model and output
distribution.
For example:
● If the model predicts Gaussian-distributed outputs, minimizing negative
log-likelihood reduces to minimizing Mean Squared Error (MSE):
Here, f(x;θ) represents the model’s predicted mean for a Gaussian distribution.
● The discarded constant term comes from the Gaussian variance,
which is not parameterized in this case.
Thus, the equivalence between MLE and minimizing MSE is not limited to
linear models—it holds for any model predicting the mean of a Gaussian.
Advantages of Maximum Likelihood-Based Cost Functions
1. Automated Cost Function Selection:
o Specifying the output probability distribution p(y | x)
automatically determines the cost function as −log p(y∣x).
o This removes the need to manually design different cost
functions for different models.
2. Better Gradient Properties for Learning:
o A good learning algorithm requires a gradient that is large and
predictable.
o Some functions, like sigmoid, saturate (become very flat), making
their gradients too small for effective learning.
o The negative log-likelihood (NLL) helps mitigate this issue
because:
▪ Many output activation functions involve an exponential
(exp) function.
▪ The log function in NLL cancels out the exp term, reducing
saturation effects.
Cross-Entropy Cost and Its Unique Properties
Unlike standard cost functions, the cross-entropy loss (which comes from
maximizing likelihood) has some unusual properties:
1. No Well-Defined Minimum Value:
o In many neural network models, the cost function does not have a
minimum value in the traditional sense.
o For discrete classification problems (e.g., softmax or logistic
regression), the model’s probability output can never be exactly 0
or 1 but can approach these values arbitrarily closely.
2. Divergence in Real-Valued Models:
o If the model learns to adjust the variance of a Gaussian output
distribution, it can assign an extremely high probability density to
the correct output values.
o This makes the cross-entropy loss tend toward negative infinity.
Activation function must be efficient and it should reduce the computation time
because the neural network sometimes trained on millions of data points.
Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the output of each neuron.
Disadvantages:
∙ Vanishing gradient—for very high or very low values of X, there is almost no
change to the prediction, causing a vanishing gradient problem. This can result
in the network refusing to learn further, or being too slow to reach an accurate
prediction.
∙ Outputs not zero centered.
∙ Computationally expensive
R(z) = max(0.1*z,z)
Advantages
∙ Prevents dying ReLU problem—this variation of ReLU has a small positive
slope in the negative area, so it does enable backpropagation, even for negative
input values
∙ Otherwise like ReLU
Disadvantages
∙ Results not consistent—leaky ReLU does not provide consistent predictions
for negative input values.
Softmax:
∙ Sigmoid able to handle more than two cases(class label).
∙ Softmax can handle multiple cases. Softmax function squeeze the output for
each class between 0 and 1 with sum of them is 1.
∙ It is ideally used in the final output layer of the classifier, where we are
actually trying to attain the probabilities.
∙ Softmax produces multiple outputs for an input array. For this reason, we can
build neural network models that can classify more than 2 classes instead of
binary class solution.