0% found this document useful (0 votes)
60 views36 pages

Unit 1 Notes Final

The document provides an introduction to neural networks, focusing on artificial neurons, perceptrons, and various computational models of neurons. It explains the structure and functioning of neural networks, including the feedforward architecture and the backpropagation algorithm used for training. Additionally, it discusses the advantages and challenges of backpropagation, as well as the importance of gradient-based learning in optimizing neural network performance.

Uploaded by

niharikajain1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views36 pages

Unit 1 Notes Final

The document provides an introduction to neural networks, focusing on artificial neurons, perceptrons, and various computational models of neurons. It explains the structure and functioning of neural networks, including the feedforward architecture and the backpropagation algorithm used for training. Additionally, it discusses the advantages and challenges of backpropagation, as well as the importance of gradient-based learning in optimizing neural network performance.

Uploaded by

niharikajain1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 1: Introduction to Neural Networks

1. Artificial Neurons
Artificial neurons are inspired by biological neurons in the human brain. They
are the fundamental building blocks of neural networks.
Structure of an Artificial Neuron:
●​ Inputs (x1, x2, ..., xn): Features or data inputs.
●​ Weights (w1, w2, ..., wn): Adjustable parameters that influence input
importance.
●​ Bias (b): An additional parameter to adjust the output.
●​ Activation Function: Applies a non-linear transformation to the input.
●​ Output (y): Computed result of the neuron.

Mathematical Model:

2. Perceptron
The perceptron is the simplest type of artificial neural network that can be used
for binary classification problems.

Characteristics:
●​ Consists of a single layer of neurons.
●​ Uses the step function as an activation function.
●​ Can solve linearly separable problems.
Perceptron Learning Rule:
The Perceptron Learning Rule is a supervised learning algorithm used to train a
single-layer perceptron model, where the goal is to classify inputs into one of
two classes (binary classification). It is based on adjusting the weights of the
perceptron to minimize the classification error.
●​ The perceptron consists of an input layer, weights, a bias term, and an
activation function (usually a step function).
●​ Input: The perceptron receives inputs x1,x2,…,xn (features).
●​ Weights: The perceptron has weights w1,w2,…,wn for each input, and a
bias b
●​ Activation Function: The output of the perceptron is determined by
applying an activation function (commonly a step function) to the
weighted sum of inputs.
The output y of the perceptron is given by:

where the activation function is typically defined as:

The goal is to adjust the weights w1,w2,…,wn​ and the bias b such that the
perceptron correctly classifies the training data points. The Perceptron Learning
Rule is an iterative algorithm that updates the weights based on the
classification error.
The perceptron learning rule updates the weights and bias after each
misclassification. The update rule is as follows:
For each training example (x(i),y(i)), where:
A.​x(i) is the input vector,
B.​ y(i) is the true label (either 0 or 1),
If the perceptron makes an incorrect prediction, it updates the weights and
bias:
●​ If the perceptron predicts 0 but the true label is 1 (false negative), the
weights are increased:

If the perceptron predicts 1 but the true label is 0 (false positive), the weights
are decreased:

The bias is updated similarly:

where:
●​ η is the learning rate, which controls the magnitude of the update,
●​ y^(i) is the predicted output of the perceptron for the current example,
●​ y(i) is the actual output (true label).
Consider a simple binary classification problem where you want to classify
two classes: class 0 and class 1. Suppose you have the following dataset:
x1​ x2​ True Label y
0 0 0
0 1 0
1 0 0
1 1 1

Initialize the weights w1=0, w2=0, and b=0


The learning rate is η=1
Perceptron Learning Rule
For each training example (x1, x2, y) we will:
1.​ Compute the weighted sum z=w1 x1 + w2 x2 + b
2.​ Apply the activation function (step function) to get the predicted output

3.​ If the predicted output ​ does not match the true label y, update the
weights and bias:

First Iteration: (0, 0) with True Label 0


Compute the weighted sum:

Apply the activation function:

The predicted output =1, but the true label y=0, so there is a
misclassification.
Update the weights and bias:

Updated values: w1=0, w2=0, b=−1

Second Iteration: (0, 1) with True Label 0


Compute the weighted sum:

Apply the activation function:

The predicted output ​=0, and the true label y=0, so no update is needed.
Updated values: w1=0, w2=0, b=−1 (no change)

Third Iteration: (1, 0) with True Label 0


Compute the weighted sum:
Apply the activation function:

The predicted output =0, and the true label y=0, so no update is needed.
Updated values: w1=0, w2=0, b=−1 (no change)

Fourth Iteration: (1, 1) with True Label 1


Compute the weighted sum:

Apply the activation function:

The predicted output =0, but the true label y=1, so there is a
misclassification.
Update the weights and bias:

Updated values: w1=1, w2=1, b=0

3. Computational Models of Neurons


Computational models of neurons are mathematical abstractions inspired by
biological neurons, and they help in understanding how information is
processed in the brain. Different models simulate different aspects of neuron
behavior, from basic decision-making processes to more complex biological
firing patterns. Different computational models exist to simulate the functioning
of neurons:
A.​McCulloch-Pitts Model: Basic threshold-based neuron.
Overview: The McCulloch-Pitts model is one of the earliest
computational models of a neuron, proposed in 1943 by Warren
McCulloch and Walter Pitts. This model is based on the concept of a
neuron that sums its inputs and produces an output depending on whether
the sum exceeds a certain threshold.
Components:
Inputs: The model receives binary inputs (usually represented as 0 or 1).
Weights: Each input is multiplied by a corresponding weight,
determining the influence of each input on the output.
Threshold: The neuron has a fixed threshold value θ. If the weighted sum
of inputs exceeds this threshold, the neuron "fires" and outputs a 1;
otherwise, it outputs a 0.
Output: Binary output, which is either 0 or 1 based on the weighted sum
of inputs and the threshold.
Mathematical Representation:

where wi are the weights, xi​are the inputs, and θ is the threshold.
B.​ Hebbian Learning Model: Learning based on synaptic strength. The
Hebbian Learning Model is based on the idea of learning through
reinforcement of connections between neurons that are activated together.
This principle, often summarized as “cells that fire together, wire
together,” was first proposed by psychologist Donald Hebb in 1949. It is a
biologically-inspired learning rule where the strength of a synaptic
connection between two neurons increases if both neurons are active at
the same time.
Components:
Pre-synaptic Neuron: The neuron sending the signal.
Post-synaptic Neuron: The neuron receiving the signal.
Synaptic Strength: The weight or strength of the connection between
neurons, which adjusts based on the activity of the neurons.
Mathematical Representation: The Hebbian learning rule is typically
described as:
Δw = η x ⋅ y
where:
●​ w is the synaptic weight,
●​ η is the learning rate,
●​ x is the activity of the pre-synaptic neuron,
●​ y is the activity of the post-synaptic neuron.

C.​ Spiking Neuron Model: Simulates real biological neuron firing patterns.
A spiking neuron model is a mathematical description of how neurons
fire electrical signals. These models are used to simulate the behavior of
biological neurons in the brain.
How does a spiking neuron model work?
1.​ The neuron receives input information as a series of spikes over time.
2.​ The neuron's membrane potential increases as it accumulates input.
3.​ When the membrane potential exceeds a threshold, the neuron fires a
spike.
4.​ The neuron's state variable is reset to a lower value after firing.

D.​Perceptron Model: It consists of a single layer of neurons connected to


inputs. The perceptron model works by applying a linear function to the
input data (a weighted sum) followed by an activation function (e.g., a
step or sigmoid function). Initialize Parameters: Set weights and bias.
The Perceptron is one of the simplest artificial neural
network architectures, introduced by Frank Rosenblatt in 1957. It is
primarily used for binary classification.
4. Structure of Neural Networks
Neural networks are composed of layers of interconnected artificial neurons.
The main types of neural networks include:
●​ Single-layer Networks: Contain only an input and output layer.

●​ Multi-layer Networks: Include hidden layers between the input and


output layers.
Types of Layers:
1.​ Input Layer: Accepts input features.
2.​ Hidden Layer(s): Intermediate processing layers with neurons.
3.​ Output Layer: Produces the final prediction.

What is Feed Forward Neural Network?

The most fundamental kind of neural network, in which input data travels only
in one way before leaving through output nodes and passing through artificial
neural nodes. Input and output layers are present in locations where hidden
layers may or may not be present. Based on this, they are further divided into
single-layered and multi-layered feed-forward neural networks. The information
only flows forward in the neural network, first through the input nodes, then
through the hidden layers (single or many layers), and ultimately through the
output nodes, which is why this network of models is termed feedforward.
It cannot spread backward; it can only go forward.
Multi-layer Feed Forward Neural Network: This Neural Network or
Artificial Neural Network has multiple hidden layers that make it a multilayer
neural Network and it is feed-forward because it is a network that follows a
top-down approach to train the network. In this network there are the following
layers:
1.​ Input Layer: It is starting layer of the network that has a weight
associated with the signals.
2.​ Hidden Layer: This layer lies after the input layer and contains multiple
neurons that perform all computations and pass the result to the output
unit.
3.​ Output Layer: It is a layer that contains output units or neurons and
receives processed data from the hidden layer, if there are further hidden
layers connected to it then it passes the weighted unit to the connected
hidden layer for further processing to get the desired result.

Inputs are multiplied by weights and supplied into the activation function,
where they are adjusted to minimize loss during backpropagation. Weights are
just machine-learned values from Neural Networks. They modify themselves
based on the gap between projected and training outcomes. Softmax is used as
an output layer activation function after nonlinear activation functions.
Application of Multilayer Feed-Forward Neural Network:
1.​ Medical field
2.​ Speech regeneration
3.​ Data processing and compression
4.​ Image processing
Limitations:
This ANN is a basic form of Neural Network that has no cycles and computes
only in the forward direction. It has some limitations like sometimes
information about the neighborhood is lost and in that case, it becomes difficult
to process further all steps are needed to be performed again and it does not
support back propagation so the network cannot learn or correct the fault of the
previous stage.

Backpropagation in Neural Network


Backpropagation is a method used to train artificial neural networks. Its goal is
to reduce the difference between the model’s predicted output and the actual
output by adjusting the weights and biases in the network.
Backpropagation is a powerful algorithm in deep learning, primarily used to
train artificial neural networks, particularly feed-forward networks iteratively,
minimizing the cost function by adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following
the error gradient. Backpropagation often utilizes optimization algorithms like
gradient descent or stochastic gradient descent. The algorithm computes the
gradient using the chain rule from calculus, allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.

Why is Backpropagation Important?


Backpropagation plays a critical role in how neural networks improve over
time. Here's why:
1.​ Efficient Weight Update: It computes the gradient of the loss function
with respect to each weight using the chain rule, making it possible to
update weights efficiently.
2.​ Scalability: The backpropagation algorithm scales well to networks with
multiple layers and complex architectures, making deep learning feasible.
3.​ Automated Learning: With backpropagation, the learning process
becomes automated, and the model can adjust itself to optimize its
performance.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward
Pass and the Backward Pass.
How Does the Forward Pass Work?
In the forward pass, the input data is fed into the input layer. These inputs,
combined with their respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in above
figure, the output from h1 serves as the input to h2. Before applying an
activation function, a bias is added to the weighted inputs.
Each hidden layer applies an activation function like ReLu(Rectified linear
unit), which returns the input if it’s positive and zero otherwise. This adds
non-linearity, allowing the model to learn complex relationships in the data.
Finally, the outputs from the last hidden layer are passed to the output layer,
where an activation function, such as softmax, converts the weighted outputs
into probabilities for classification.
How Does the Backward Pass Work?
In the backward pass, the error (the difference between the predicted and actual
output) is propagated back through the network to adjust the weights and biases.
One common method for error calculation is the Mean Squared Error, given by:
MSE= (Predicted Output−Actual Output)
Once the error is calculated, the network adjusts weights using gradients, which
are computed with the chain rule. These gradients indicate how much each
weight and bias should be adjusted to minimize the error in the next iteration.
The backward pass continues layer by layer, ensuring that the network learns
and improves its performance. The activation function, through its derivative,
plays a crucial role in computing these gradients during backpropagation.
Example of Backpropagation in Machine Learning
Assume the neurons use the sigmoid activation function for the forward and
backward pass. The target output is 0.5, and the learning rate is 1.
Forward Propagation
1. Initial Calculation
The weighted sum at each node is calculated using:
aj​=∑(wi​,j∗xi​)
Where,
●​ aj is the weighted sum of all the inputs and weights at each node,
●​ wi,j​represents the weights associated with the jth input to the ith neuron,
●​ xi represents the value of the jth input,

2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity
into the model.
Advantages of Backpropagation for Neural Network Training
The key benefits of using the backpropagation algorithm are:
●​ Ease of Implementation: Backpropagation is beginner-friendly,
requiring no prior neural network knowledge, and simplifies
programming by adjusting weights via error derivatives.
●​ Simplicity and Flexibility: Its straightforward design suits a range of
tasks, from basic feedforward to complex convolutional or recurrent
networks.
●​ Efficiency: Backpropagation accelerates learning by directly updating
weights based on error, especially in deep networks.
●​ Generalization: It helps models generalize well to new data, improving
prediction accuracy on unseen examples.
●​ Scalability: The algorithm scales efficiently with larger datasets and
more complex networks, making it ideal for large-scale tasks.
Challenges with Backpropagation
While backpropagation is powerful, it does face some challenges:
1.​ Vanishing Gradient Problem: In deep networks, the gradients can
become very small during backpropagation, making it difficult for the
network to learn. This is common when using activation functions like
sigmoid or tanh.
2.​ Exploding Gradients: The gradients can also become excessively large,
causing the network to diverge during training.
3.​ Overfitting: If the network is too complex, it might memorize the
training data instead of learning general patterns.

Gradient-Based Learning
Gradient descent is the backbone of the learning process for various algorithms,
including linear regression, logistic regression, support vector machines, and
neural networks which serves as a fundamental optimization technique to
minimize the cost function of a model by iteratively adjusting the model
parameters to reduce the difference between predicted and actual values,
improving the model’s performance.
Gradient descent is an optimization algorithm that’s used when training a
machine learning model. It’s based on a convex function and tweaks its
parameters iteratively to minimize a given function to its local minimum.
A gradient simply measures the change in all weights with regard to the change
in error. You can also think of a gradient as the slope of a function. The higher
the gradient, the steeper the slope and the faster a model can learn. But if the
slope is zero, the model stops learning. In mathematical terms, a gradient is a
partial derivative with respect to its inputs.
Imagine a blindfolded man who wants to climb to the top of a hill with the
fewest steps possible. He might start climbing the hill by taking really big steps
in the steepest direction. But as he comes closer to the top, his steps will get
smaller and smaller to avoid overshooting it.
Assume the image illustrates our hill from a top-down view and the red arrows
are the steps of our climber. A gradient in this context is a vector that contains
the direction of the steepest step the blindfolded man can take and how long that
step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one
reaching from X3 to X4. This is because the steepness/slope of the hill, which
determines the length of the vector, is less. This perfectly represents the
example of the hill because the hill is getting less steep the higher it’s climbed,
so a reduced gradient goes along with a reduced slope and a reduced step size
for the hill climber.
How Does Gradient Descent Work?
Now Instead of climbing up a hill, think of gradient descent as hiking down to
the bottom of a valley. The equation below describes what the gradient descent
algorithm does: b is the next position of our climber, while a represents his
current position. The minus sign refers to the minimization part of the gradient
descent algorithm. The gamma in the middle is a waiting factor and the gradient
term ( Δf(a) ) is simply the direction of the steepest descent.
Imagine you have a machine learning problem and want to train your algorithm
with gradient descent to minimize your cost-function J(w, b) and reach its local
minimum by tweaking its parameters (w and b). The image below shows the
horizontal axes representing the parameters (w and b), while the cost
function J(w, b) is represented on the vertical axes.

We want to find the values of w and b that correspond to the minimum of the
cost function (marked with the red arrow). To start, we initialize w and b with
some random numbers. Gradient descent then starts at that point (somewhere
around the top of our illustration), and it takes one step after another in the
steepest downside direction (i.e., from the top to the bottom of the illustration)
until it reaches the point where the cost function is as small as possible.
Types of Gradient Descent
There are three popular types of gradient descent that mainly differ in the
amount of data they use — batch, stochastic and mini-batch.
Batch Gradient Descent
Batch gradient descent, also called vanilla gradient descent, calculates the error
for each example within the training dataset, but it only gets updated after all
training examples have been evaluated. This process is like a cycle and called a
training epoch.
An advantage of batch gradient descent is its computational efficiency: it
produces a stable error gradient and a stable convergence. But the stable error
gradient can sometimes result in a state of convergence that isn’t the best the
model can achieve. It also requires the entire training dataset to be in memory
and available to the algorithm.
Stochastic Gradient Descent
By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each training
example one by one. Depending on the problem, this can make SGD faster than
batch gradient descent. One advantage is that frequent updates allow us to have
a pretty detailed rate of improvement.
Mini-Batch Gradient Descent
Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of SGD and batch gradient descent. It simply splits the training dataset
into small batches and performs an update for each of those batches. This
creates a balance between the robustness of stochastic gradient descent and the
efficiency of batch gradient descent.
How big the steps gradient descent takes in the direction of the local
minimum is determined by the learning rate, which figures out how fast or slow
we will move towards the optimal weights.
For the gradient descent algorithm to reach the local minimum, we must set the
learning rate to an appropriate value, which is neither too low nor too high. This
is important because if the steps it takes are too big, it may not reach the local
minimum because it bounces back and forth between the convex function of
gradient descent (see left image below). If we set the learning rate to a very
small value, gradient descent will eventually reach the local minimum but that
may take a while (see the right image).
Challenges with gradient descent
●​ Local minima and saddle points: For convex problems, gradient descent
can find the global minimum with ease, but as nonconvex problems
emerge, gradient descent can struggle to find the global minimum, where
the model achieves the best results. Recall that when the slope of the cost
function is at or close to zero, the model stops learning.
●​ Vanishing and Exploding Gradients:
Vanishing gradients: This occurs when the gradient is too small. As we
move backwards during backpropagation, the gradient continues to
become smaller, causing the earlier layers in the network to learn more
slowly than later layers. When this happens, the weight parameters update
until they become insignificant—i.e. 0—resulting in an algorithm that is
no longer learning.
Exploding gradients: This happens when the gradient is too large,
creating an unstable model. In this case, the model weights will grow too
large, and they will eventually be represented as NaN. One solution to
this issue is to leverage a dimensionality reduction technique, which can
help to minimize complexity within the model.

Key Differences Between Linear Models and Neural Networks


The main distinction between neural networks and linear models lies in
nonlinearity. Unlike linear models, where the optimization problem remains
convex, neural networks introduce non-convex loss functions. This makes
training more challenging, as non-convex functions often have multiple local
minima, saddle points, or plateaus.
Gradient-Based Optimization for Neural Networks
●​ Neural networks are usually trained using iterative, gradient-based
optimizers.
●​ Unlike linear models (which can be solved using closed-form solutions),
training neural networks relies on gradient descent to reduce the cost
function to a low value.
●​ Methods like stochastic gradient descent (SGD) do not guarantee
convergence to a global minimum but aim to reach a good local
minimum.
Convex vs. Non-Convex Optimization
●​ ​
Convex optimization (e.g., logistic regression, SVMs): Converges to a
global minimum from any starting point (in theory). In practice, it is
robust but may encounter numerical issues.
●​ Non-convex optimization (e.g., neural networks): Has no convergence
guarantee. The final model depends on the initial parameters and the
training dynamics.
Parameter Initialization in Neural Networks
●​ Weights: Initialized to small random values to prevent symmetry issues
during training.
●​ Biases: Can be initialized to zero or small positive values.
●​ Why? Proper initialization prevents neurons from learning redundant
features and ensures stable gradient updates.
Gradient-Based Optimization Algorithms
Most deep learning models, including feedforward networks, rely on variations
of gradient descent. Some common refinements include:
1.​ Stochastic Gradient Descent (SGD): Updates weights using mini-batches
to improve efficiency.
2.​ Momentum-based Optimization: Adds velocity terms to accelerate
convergence.
3.​ Adam (Adaptive Moment Estimation): Combines momentum and
adaptive learning rates for better training stability.
Gradient Computation & Backpropagation
Computing the gradient for a neural network is more complex than for linear
models, but it is still efficient and exact. In Section 6.5, we discuss
backpropagation, the fundamental algorithm used to compute gradients in
neural networks.
Applying Gradient-Based Learning
To apply gradient-based learning to neural networks, we need to:
1.​ Choose a cost function (e.g., Mean Squared Error for regression,
Cross-Entropy for classification).
2.​ Decide on the model's output representation (e.g., softmax for
classification, linear for regression).
6.2.1 Cost Functions
An essential part of designing a deep neural network is choosing an appropriate
cost function. Fortunately, the cost functions for neural networks are similar to
those used in other parametric models, such as linear models.
Cost Functions and Maximum Likelihood Estimation
Most parametric models define a probability distribution p(y | x), and training
typically follows the principle of maximum likelihood estimation (MLE). This
principle suggests that the best model is the one that maximizes the likelihood
of the observed data.
●​ In practice, maximizing likelihood often translates into minimizing the
negative log-likelihood.
●​ For classification tasks, this naturally leads to using the cross-entropy loss
function.
Cross-Entropy Loss Function
The cross-entropy loss is widely used in classification tasks where the model
predicts a probability distribution over classes. It measures the dissimilarity
between the true and predicted probability distributions.
For a binary classification problem, where the output y∈{0,1}, the
cross-entropy loss is:

For a multi-class classification problem, where the output is a probability


distribution over multiple classes, we use the softmax function to compute
probabilities and define the cross-entropy loss as:
where:
●​ C is the number of classes,
●​ yij, is 1 if sample i belongs to class j, and 0 otherwise,
●​ y^ij is the predicted probability of class j.
Alternative Cost Functions
In some cases, instead of predicting a full probability distribution, we only
predict a specific statistic (e.g., the mean). Different loss functions can be used
depending on the type of prediction:
1.​ Mean Squared Error (MSE)
o​ Used for regression tasks where the model predicts a continuous
value.

o​ Formula:
2.​ Mean Absolute Error (MAE)
o​ Another regression loss function, but less sensitive to outliers than
MSE.

o​ Formula:

3.​ Huber Loss


o​ A combination of MSE and MAE, useful for handling outliers
in regression.
4.​ Hinge Loss
o​ Used in Support Vector Machines (SVMs) for classification
tasks.

o​ Formula:
Regularization in Cost Functions
The total cost function used for training a neural network often includes a
regularization term to prevent overfitting.
●​ L2 Regularization (Weight Decay): λ ∑ w2
o​ Penalizes large weights to improve generalization.
●​ L1 Regularization (Lasso): λ ∑ ∣w∣
o​ Encourages sparse models by driving some weights to zero.
More advanced regularization strategies, such as dropout and batch
normalization, are discussed later.

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood


Most modern neural networks are trained using maximum likelihood estimation
(MLE). This means that the cost function is the negative log-likelihood (NLL),
which is often written in terms of cross-entropy between the training data
distribution and the model’s predicted distribution.
Mathematical Formulation
Given a neural network with parameters θ, the cost function is:

This function measures how well the model’s predicted probability distribution
pmodel(y∣x) matches the true data distribution p^data(y∣x). By minimizing this
function, we maximize the likelihood that the model assigns to the observed
training data.
Cost Function Variability Across Models
The exact form of the cost function depends on the type of model and output
distribution.
For example:
●​ If the model predicts Gaussian-distributed outputs, minimizing negative
log-likelihood reduces to minimizing Mean Squared Error (MSE):
Here, f(x;θ) represents the model’s predicted mean for a Gaussian distribution.
●​ The discarded constant term comes from the Gaussian variance,
which is not parameterized in this case.
Thus, the equivalence between MLE and minimizing MSE is not limited to
linear models—it holds for any model predicting the mean of a Gaussian.
Advantages of Maximum Likelihood-Based Cost Functions
1.​ Automated Cost Function Selection:
o​ Specifying the output probability distribution p(y | x)
automatically determines the cost function as −log p(y∣x).
o​ This removes the need to manually design different cost
functions for different models.
2.​ Better Gradient Properties for Learning:
o​ A good learning algorithm requires a gradient that is large and
predictable.
o​ Some functions, like sigmoid, saturate (become very flat), making
their gradients too small for effective learning.
o​ The negative log-likelihood (NLL) helps mitigate this issue
because:
▪​ Many output activation functions involve an exponential
(exp) function.
▪​ The log function in NLL cancels out the exp term, reducing
saturation effects.
Cross-Entropy Cost and Its Unique Properties
Unlike standard cost functions, the cross-entropy loss (which comes from
maximizing likelihood) has some unusual properties:
1.​ No Well-Defined Minimum Value:
o​ In many neural network models, the cost function does not have a
minimum value in the traditional sense.
o​ For discrete classification problems (e.g., softmax or logistic
regression), the model’s probability output can never be exactly 0
or 1 but can approach these values arbitrarily closely.
2.​ Divergence in Real-Valued Models:
o​ If the model learns to adjust the variance of a Gaussian output
distribution, it can assign an extremely high probability density to
the correct output values.
o​ This makes the cross-entropy loss tend toward negative infinity.

Preventing Instabilities in Maximum Likelihood Learning


Because models trained with MLE can sometimes over-optimize and assign
extreme likelihood values to training samples, we use regularization techniques
to stabilize learning.
Common Regularization Techniques
●​ L2 Regularization (Weight Decay): Penalizes large weights to prevent
overfitting.
●​ Dropout: Randomly deactivates neurons during training to improve
generalization.
●​ Batch Normalization: Normalizes activations to keep them in a stable
range.

5. Multilayer Feedforward Neural Networks (MLFFNN)


MLFFNNs are composed of multiple layers of neurons with connections that do
not form cycles.
Features:
●​ Information flows in one direction.
●​ Can approximate complex functions.
●​ Typically trained using backpropagation.
6. Backpropagation Learning
Backpropagation is the most common training algorithm for neural networks.
Steps in Backpropagation:
1.​ Forward Pass: Compute the output by passing input through the
network.
2.​ Loss Calculation: Measure the difference between predicted and actual
output.
3.​ Backward Pass: Compute gradients of the loss function with respect to
weights.
4.​ Weight Update: Adjust weights using gradient descent.
Loss Function:
Common loss functions include:
●​ Mean Squared Error (MSE)
●​ Cross-Entropy Loss
7. Empirical Risk Minimization
Empirical risk minimization (ERM) is the principle of minimizing the average
loss over the training data.
Remp(f)=1n∑i=1nL(f(xi),yi)
ERM seeks to approximate the expected risk by minimizing empirical risk.
8. Bias-Variance Tradeoff
The bias-variance tradeoff refers to the balance between:
●​ Bias: Error due to overly simplistic assumptions.
●​ Variance: Error due to sensitivity to training data.
Solutions to Address the Tradeoff:
●​ Increase training data.
●​ Use regularization techniques.
●​ Choose appropriate model complexity.
9. Regularization
Regularization techniques are used to prevent overfitting by adding constraints
to the learning process.
Common Regularization Techniques:
●​ L1 Regularization (Lasso): Adds absolute weight penalties.
●​ L2 Regularization (Ridge): Adds squared weight penalties.
●​ Dropout: Randomly drops neurons during training.
10. Output Units: Linear, Softmax
Output units determine the type of prediction the neural network makes.
Linear Output Units:
●​ Suitable for regression tasks.
●​ No activation function applied.
Softmax Output Units:
●​ Used for multi-class classification.
●​ Converts logits into probabilities.
σ(zi)=e^zi∑j=1ne^zj
11. Hidden Units: tanh, ReLU
Hidden units apply activation functions to introduce non-linearity.
Common Activation Functions:
1.​ Tanh (Hyperbolic Tangent):
o​ Ranges from -1 to 1.
o​ Helps with gradient flow.
2.​ ReLU (Rectified Linear Unit):
o​ Outputs zero for negative inputs and linear for positive.
o​ Efficient and widely used in deep learning.
ReLU(x)=max(0,x)

Define activation function. Explain different types of activation functions.


Activation Functions are extremely important feature of the Artificial Neural
Network. They basically decide whether a neuron should be activated or not. It
limits the output signal to a finite value.
Activation Function does the non-linear transformation to the input making it
capable to learn more complex relation between input and output. It makes the
network capable of learning more complex pattern.
Without an activation function, the neural network is just a linear regression
model as it performs only summation of product of input and weights. Eg. In the
below image 2 requires a complex relation which is curve unlike a simple linear
relation in image 1.

Activation function must be efficient and it should reduce the computation time
because the neural network sometimes trained on millions of data points.

Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions

Binary Step Function


A binary step function is a threshold-based activation function. If the input
value is above or below a certain threshold, the neuron is activated and sends
exactly the same signal to the next layer. We decide some threshold value to
decide output that neuron should be activated or deactivated. It is very simple
and useful to classify binary problems or classifier. Eg.f(x) = 1 if x > 0 else 0 if
x <= 0
Linear or Identity Activation Function :As you can see the function is a line
or linear. Therefore, the output of the functions will not be confined between
any range.

Equation: f(x) = x Range : (-infinity to infinity) It doesn‟t help with the


complexity or various parameters of usual data that is fed to the neural networks
Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
Nonlinearity helps to makes the graph look something like this.

The main terminologies needed to understand for nonlinear functions are:


Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also
known as slope. Monotonic function: A function which is either entirely
non-increasing or non-decreasing. The Nonlinear Activation Functions are
mainly divided on the basis of their range or curves-
Advantage of Non-linear function over the Linear function :
Differential is possible in all the non -linear function. Stacking of network is
possible, which helps us in creating deep neural nets. It makes it easy for the
model to generalize
Sigmoid(Logistic AF)(σ): The main reason why we use sigmoid function is it
exists between 0 to 1. It is especially used for models where we have to predict
the probability as output. Since probability of anything exists only between the
range of 0 and 1, sigmoid is the right choice.

Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the output of each neuron.
Disadvantages:
∙ Vanishing gradient—for very high or very low values of X, there is almost no
change to the prediction, causing a vanishing gradient problem. This can result
in the network refusing to learn further, or being too slow to reach an accurate
prediction.
∙ Outputs not zero centered.
∙ Computationally expensive

TanH(Hyperbolic Tangent AF):


TanH is also like logistic sigmoid but in better way. The range of the
TanHfunction is from -1 to +1. TanH is often preferred over the sigmoid neuron
because it is zero centred. The advantage is that the negative inputs will be
mapped strongly negative and the zero inputs will be mapped near zero in tanh
graph.
ReLU(Rectified Linear Unit):
The ReLU is the most used activation function. It is used in almost all
convolution neural networks in hidden layers only. The ReLU is half
rectified(from bottom). f(z) = 0, if z < 0 = z, otherwise R(z) = max(0,z) The
range is 0 to inf.
Advantages
∙ Avoids vanishing gradient problem.
∙ Computationally efficient—allows the network to converge very quickly
∙ Non-linear—although it looks like a linear function, ReLU has a derivative
function and allows for backpropagation
Disadvantages
∙ Can only be used with a hidden layer
∙ hard to train on small datasets and need much data for learning non linear
behavior.
∙ The Dying ReLU problem—when inputs approach zero, or are negative, the
gradient of the function becomes zero, the network cannot perform
backpropagation and cannot learn.
Leaky ReLU Activation Function
We needed the Leaky ReLU activation function to solve the “Dying ReLU‟
problem. Leaky ReLU we do not make all negative inputs to zero but to a value
near to zero which solves the major issue of ReLU activation function.

R(z) = max(0.1*z,z)
Advantages
∙ Prevents dying ReLU problem—this variation of ReLU has a small positive
slope in the negative area, so it does enable backpropagation, even for negative
input values
∙ Otherwise like ReLU
Disadvantages
∙ Results not consistent—leaky ReLU does not provide consistent predictions
for negative input values.

Softmax:
∙ Sigmoid able to handle more than two cases(class label).
∙ Softmax can handle multiple cases. Softmax function squeeze the output for
each class between 0 and 1 with sum of them is 1.
∙ It is ideally used in the final output layer of the classifier, where we are
actually trying to attain the probabilities.
∙ Softmax produces multiple outputs for an input array. For this reason, we can
build neural network models that can classify more than 2 classes instead of
binary class solution.

You might also like