0% found this document useful (0 votes)

15 views21 pages

Deep Learning Assignment 01

This document is a deep learning assignment discussing neural network architectures, specifically Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). It details their structures, applications in various fields, and differences from fully connected networks, as well as activation functions like ReLU and their advantages and limitations. The assignment highlights real-world applications of these networks in areas such as healthcare, finance, and natural language processing.

Uploaded by

Syeda Rutab Aziz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

Deep Learning Assignment 01

Uploaded by

Syeda Rutab Aziz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning Assignment 01.

md 2025-03-30

Deep Learning Assignment 01

Submitted by: Ghulam Muhammad Mahar

Submitted to: Professor Dr. Sheraz Naseer Sahab

Date: [30-03-2025]

Question 1: Exploring Neural Network

Architectures

Answer:
Neural Network Architecture:
The architecture of neural networks is made up of an input, output, and hidden layer. Neural networks
themselves, or artificial neural networks (ANNs), are a subset of machine learning designed to mimic the
processing power of a human brain. Neural networks function by passing data through the layers of an
artificial neuron.

Two commonly used advanced models are:

1. Convolutional Neural Networks (CNNs)

2. Recurrent Neural Networks (RNNs).

1st. Convolutional Neural Networks (CNNs):

Convolutional Neural Networks (CNNs) are a type of deep learning neural network architecture that is
particularly well suited to image classification and object recognition tasks.

The architecture of Convolutional Neural Network:

A typical CNN architecture is made up of three main components: the input layer, the hidden layers, and the
output layer. The input layer receives the input image and passes it to the hidden layers, which are made up of
multiple convolutional and pooling layers. The output layer provides the predicted class label or probability
scores for each class.

A common architecture for a CNN is to have multiple convolutional layers, followed by one or more pooling
layers, and then a fully connected layer that provides the final output.

Layers of Convolutional neural network:

The layers of a Convolutional Neural Network (CNN) can be broadly classified into the following categories:

Convolutional Layer:

1 / 21
Deep Learning Assignment [Link] 2025-03-30

The convolutional layer is responsible for extracting features from the input image. It performs a convolution
operation on the input image, where a filter or kernel is applied to the image to identify and extract specific
features.

Pooling Layer:

The pooling layer is responsible for reducing the spatial dimensions of the feature maps produced by the
convolutional layer. It performs a down-sampling operation to reduce the size of the feature maps and reduce
computational complexity.

Activation Layer:

The activation layer applies a non-linear activation function, such as the ReLU function, to the output of the
pooling layer. This function helps to introduce non-linearity into the model, allowing it to learn more complex
representations of the input data.

Fully Connected Layer:

2 / 21
Deep Learning Assignment [Link] 2025-03-30

The fully connected layer is a traditional neural network layer that connects all the neurons in the previous
layer to all the neurons in the next layer. This layer is responsible for combining the features learned by the
convolutional and pooling layers to make a prediction.

CNN Applications in Real-World Scenarios

CNNs have a wide range of applications in computer vision, including image classification, object detection,
semantic segmentation, and style transfer. Here are some real-world applications where CNNs are making a
significant impact:

1. Image and Video Recognition

CNNs are widely used for facial recognition in security systems, such as Apple’s Face ID and Amazon
Rekognition. In social media, platforms like Facebook and Instagram use CNNs for automatic tagging and
content moderation by detecting inappropriate images or videos.

2. Autonomous Vehicles

Self-driving car companies like Tesla, Waymo, and NVIDIA use CNN-based deep learning models to
analyze road conditions, detect obstacles, and interpret traffic signs in real-time. This technology improves the
safety and efficiency of autonomous navigation.

3. Healthcare Imaging

CNNs are revolutionizing medical imaging by enabling early disease detection. Companies like Google
Health and IBM Watson use CNNs to analyze X-rays, MRIs, and CT scans to detect diseases such as cancer,
pneumonia, and diabetic retinopathy with higher accuracy than traditional methods.

4. Financial Fraud Detection

Banks and financial institutions like JP Morgan and Mastercard use CNNs to analyze transaction patterns
and detect anomalies that indicate fraud. CNNs can identify suspicious activities in credit card transactions
and stock market trading by learning from vast financial datasets.

5. Retail and E-commerce

CNNs power recommendation systems in platforms like Amazon, Netflix, and Alibaba by analyzing user
behavior. These models suggest personalized products, improving customer engagement and increasing

3 / 21
Deep Learning Assignment [Link] 2025-03-30

sales. In fashion retail, CNNs help companies like Zalando and ASOS with virtual try-ons and clothing
recognition.

6. Industrial Automation (Predictive Maintenance)

Manufacturing companies like Siemens, General Electric (GE), and Bosch use CNNs for predictive
maintenance. CNN models analyze sensor data and machine images to detect potential failures before
they occur, reducing downtime and increasing operational efficiency.

Key Differences between CNNs and FCNs

2nd. Recurrent Neural Networks (RNNs):

Definition:
Recurrent Neural Networks (RNNs) are a type of neural network designed to process sequential data, like
text or time series, by maintaining a "memory" of past inputs to make predictions or conclusions based on
the order of information.

How Does Recurrent Neural Networks Work?

The information in recurrent neural networks cycles through a loop to the middle hidden layer.

The input layer x receives and processes the neural network’s input before passing it on to the middle layer.

In the middle layer h, multiple hidden layers can be found, each with its activation functions, weights, and
biases. You can utilize a recurrent neural network if the various parameters of different hidden layers are not
impacted by the preceding layer, i.e., if There is no memory in the neural network.
4 / 21
Deep Learning Assignment [Link] 2025-03-30

RNN Architecture
RNNs are a type of neural network with hidden states and allow past outputs to be used as inputs. They
usually go like this:

Let's breakdown it's key components:

Input Layer: This layer receives the initial element of the sequence data. For example, in a sentence, it
might receive the first word as a vector representation.

Hidden Layer: The heart of the RNN, the hidden layer contains a set of interconnected neurons. Each
neuron processes the current input along with the information from the previous hidden layer’s
state. This “state” captures the network’s memory of past inputs, allowing it to understand the current
element in context.

Activation Function: This function introduces non-linearity into the network, enabling it to learn
complex patterns. It transforms the combined input from the current input layer and the previous
hidden layer state before passing it on.

Output Layer: The output layer generates the network’s prediction based on the processed
information. In a language model, it might predict the next word in the sequence.

Recurrent Connection: A key distinction of RNNs is the recurrent connection within the hidden
layer. This connection allows the network to pass the hidden state information (the network’s
memory) to the next time step. It’s like passing a baton in a relay race, carrying information about
previous inputs forward.

Reference: [Link]
networks-rnn/

Difference between RNNs and Fully Connected Networks

Fully Connected Networks
Feature Recurrent Neural Networks (RNNs)
(FCN or MLP)

Input Processes data as a single, static Processes sequential data, where each input is related
Processing input vector to the previous ones

Each layer is fully connected to

the next, but there are no Has recurrent connections, where the output of a layer
Architecture
connections within or between is fed back as input to the same layer, creating a loop
layers

No internal memory or state to Has an internal memory or state that allows it to

Memory
remember past inputs remember information from previous inputs

Designed for processing sequential data, making them

Handling Not designed for processing
suitable for tasks like time series prediction, natural
Sequences sequential data
language processing, and speech recognition

5 / 21
Deep Learning Assignment [Link] 2025-03-30

Fully Connected Networks

Feature Recurrent Neural Networks (RNNs)
(FCN or MLP)

Shares weights across time steps, meaning the same

Weight Each layer has its own unique set
weights are used to process each input in the
Sharing of weights
sequence

Image classification, object

Natural language processing, speech recognition, time
Applications detection, and general pattern
series prediction, and machine translation
recognition

Real-World Applications of Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are widely used for processing sequential data, making them ideal for
applications in speech recognition, time-series forecasting, and natural language understanding. Below
are some real-world use cases of RNNs in various industries:

1. Speech Recognition

Virtual assistants like Apple’s Siri, Amazon Alexa, and Google Assistant use RNNs to process spoken
language and generate appropriate responses. These systems analyze voice inputs in real-time to convert
them into text and understand intent.

2. Machine Translation

Google Translate, DeepL, and Microsoft Translator use RNN-based models (like LSTMs and GRUs) to
translate languages by understanding sentence structure, grammar, and context. These models provide
context-aware translations rather than word-to-word conversions.

3. Text Generation (Chatbots & AI Writing)

Chatbots like OpenAI’s ChatGPT, Meta’s BlenderBot, and Google’s Bard use advanced RNN models to
generate text, hold conversations, and even assist in content creation. AI-driven writing tools like Jasper AI
and [Link] leverage RNNs for automated blog writing, email drafting, and creative storytelling.

4. Time-Series Forecasting (Stock & Weather Prediction)

Financial firms like Goldman Sachs and Bloomberg use RNNs to analyze historical stock market data and
predict future trends. Similarly, weather forecasting agencies like The National Weather Service (NWS)
and AccuWeather use RNNs to analyze past climate data for more accurate weather predictions.

5. Music Generation

AI music platforms like OpenAI’s MuseNet, Google Magenta, and AIVA use RNNs to compose new music by
learning from thousands of existing pieces. These models can generate melodies, harmonies, and even full
compositions in different styles.

6. Video Captioning (AI-Powered Accessibility)

6 / 21
Deep Learning Assignment [Link] 2025-03-30

YouTube, Facebook, and Netflix use RNN-based models for automatic caption generation. These models
analyze video content and generate subtitles, making video browsing more accessible to people with hearing
impairments.

7. Anomaly Detection (Fraud & Cybersecurity)

Banks like JPMorgan Chase and PayPal use RNNs for fraud detection by analyzing transaction sequences
and detecting unusual spending patterns. Similarly, cybersecurity firms like Darktrace and CrowdStrike
employ RNNs to monitor network traffic and identify suspicious activity.

8. Sentiment Analysis (Social Media & Customer Feedback)

Social media platforms like Twitter and Facebook use RNNs for sentiment analysis, helping companies track
public opinion and brand perception. Companies like Amazon and Yelp analyze customer reviews to
understand user satisfaction and improve their services.

9. Stock Market Recommendations

Investment platforms like Robinhood and Bloomberg Terminal use RNNs to analyze market trends, news,
and economic indicators to provide investment recommendations and automated trading strategies.

10. Genomics & DNA Sequence Analysis

Biotech companies like Deep Genomics and Illumina use RNNs to study DNA sequences and predict
mutations, helping researchers identify genetic disorders, drug interactions, and disease risks.

Advantages and Disadvantages of RNN

Advantages:

Handle sequential data effectively, including text, speech, and time series.

Process inputs of any length, unlike feedforward neural networks.

Share weights across time steps, enhancing training efficiency.

Disadvantages:

Prone to vanishing and exploding gradient problems, hindering learning.

Training can be challenging, especially for long sequences.

Computationally slower than other neural network architectures.

Differences Between CNNs & RNNs:

7 / 21
Deep Learning Assignment [Link] 2025-03-30

Referenc: [Link]
architectures-f571dd6a39c7

Question 2: Beyond Sigmoid - Activation Functions

in Neural Networks
Activation functions are the backbone of neural networks, enabling them to model complex, non-linear
relationships. By introducing non-linearity, they allow networks to learn from intricate data patterns, moving
beyond simple linear regression.

Two commonly used activation functions beyond Sigmoid are ReLU and Tanh.

1. Rectified Linear Unit (ReLU)

Introduction

The Rectified Linear Unit (ReLU) activation function has become the default choice in deep learning
architectures due to its simplicity and effectiveness. It is widely used in Convolutional Neural Networks
(CNNs) and other deep learning models because of its ability to introduce non-linearity while maintaining
computational efficiency. This document analyzes ReLU, discussing its advantages, limitations, and possible
alternatives.

Mathematical Definition

The ReLU function is mathematically defined as:

8 / 21
Deep Learning Assignment [Link] 2025-03-30

where:

x is the input to the neuron.

If x > 0, the function returns x, preserving positive values.
If x < 0, the function returns 0, eliminating negative values.

This simple thresholding operation makes ReLU computationally efficient compared to traditional
activation functions like sigmoid and tanh, which require expensive exponential calculations (Goodfellow et
al., 2016).

Advantages of ReLU

1. Computational Efficiency

Unlike sigmoid and tanh, ReLU does not require exponential calculations. It only involves a
simple comparison, making it highly efficient (Nair & Hinton, 2010).

2. Mitigates the Vanishing Gradient Problem

The vanishing gradient problem occurs when gradients become too small during
backpropagation, preventing deep networks from learning effectively. Since ReLU’s derivative is
either 1 (for x > 0) or 0 (for x ≤ 0), it avoids gradient shrinkage for positive inputs (Glorot et al.,
2011).

3. Sparse Activation

Many neurons output zero due to ReLU’s thresholding behavior, which leads to sparsity in
activations. This sparsity helps in reducing overfitting and enhancing model generalization
(Glorot et al., 2011).

4. Better Convergence in Deep Networks

Empirical studies show that networks trained with ReLU tend to converge faster compared to
those using sigmoid or tanh (He et al., 2015).

Limitations of ReLU

9 / 21
Deep Learning Assignment [Link] 2025-03-30

1. Dying Neurons Problem

If a neuron’s weights are updated in such a way that it always outputs a negative value, it will
permanently output zero, effectively becoming inactive. This is known as the dying ReLU
problem (Goodfellow et al., 2016).
Studies show that up to 40% of neurons can become inactive in certain networks if the learning
rate is too high (Maas et al., 2013).

2. Unbounded Outputs

While ReLU does not suffer from vanishing gradients, its unbounded nature can lead to
exploding gradients in deep networks. This often requires additional techniques such as
gradient clipping or batch normalization (Ioffe & Szegedy, 2015).

Alternative Variants of ReLU

To address ReLU’s limitations, researchers have proposed several improved versions:

1. Leaky ReLU (Maas et al., 2013)

Instead of zeroing out negative values, it allows a small negative slope:

This prevents neurons from completely dying.

1. Parametric ReLU (PReLU) (He et al., 2015)

Similar to Leaky ReLU, but the slope alpha is learned during training instead of being fixed.

2. Exponential Linear Unit (ELU) (Clevert et al., 2016)

ELU allows small negative outputs instead of zero, helping to smooth gradient updates:

ELU speeds up convergence and improves learning in deep networks.

Practical Guidelines for Using ReLU

Use Leaky ReLU or PReLU if you notice many neurons becoming inactive.

10 / 21
Deep Learning Assignment [Link] 2025-03-30

Use Batch Normalization to help stabilize training when using ReLU.

Monitor Dead Neurons: If too many neurons remain at zero outputs, adjust the learning rate.
Consider Alternative Activations like ELU in very deep networks where gradient stability is crucial.

Conclusion

ReLU is a powerful activation function that has revolutionized deep learning by providing computational
efficiency, better convergence, and reduced vanishing gradients. However, it also has some limitations,
particularly the dying neuron problem. Alternative versions like Leaky ReLU, PReLU, and ELU can help
mitigate these issues. Proper learning rate tuning and batch normalization are recommended when using
ReLU in deep networks.

2. Hyperbolic Tangent (Tanh)

Tanh (hyperbolic tangent) is a type of activation function that transforms its input into a value between -1
and 1.

It is mathematically defined as:

Where:

e is Euler's number (approximately 2.718).

x is the input to the function.

Tanh activation function graph looks like this:

11 / 21
Deep Learning Assignment [Link] 2025-03-30

As the input becomes more positive, the output approaches 1.

As the input becomes more negative, the output approaches -1.
At x = 0, the output is 0, which is the center of the function.

The function has the following properties:

Shape: It is an S-shaped curve, often referred to as a sigmoid-like curve.

Symmetry: The tanh function is symmetric around the origin (0), unlike other activation functions like
the sigmoid which has a range between 0 and 1.

Why Use Tanh in Neural Networks?

The tanh function has several advantages that make it widely used in neural networks:

1. Non-linearity:

Tanh introduces non-linearity to the model, which allows neural networks to learn complex patterns and
relationships in the data. Without non-linear activation functions, a neural network would essentially behave
as a linear model, no matter how many layers it has.

2. Centered Around Zero:

The output of the tanh function is centered around 0, unlike the sigmoid function, which outputs values
between 0 and 1. This makes the tanh activation function more useful for many types of tasks, as the mean of
the output is closer to zero, leading to more efficient training and faster convergence.

3. Gradient Behavior:

Tanh helps mitigate the vanishing gradient problem (to some extent), especially when compared to
sigmoid activation. This is because the gradient of the tanh function is generally higher than that of the
sigmoid, enabling better weight updates during backpropagation.

Derivative of Tanh
The derivative of the tanh function is also useful in the backpropagation step of training neural networks:

12 / 21
Deep Learning Assignment [Link] 2025-03-30

The derivative is always positive and is maximum at 0, which helps with gradient-based optimization.
However, just like the sigmoid, tanh also suffers from the vanishing gradient problem when the input values
are too large or too small. This can cause gradients to become very small, leading to slower training in
deeper networks.

Advantages of Using Tanh

1. Symmetry Around Zero:

Since the output is centered around zero, the network has a better chance of balancing the weights. This
helps in ensuring that the gradients don't just keep increasing or decreasing in magnitude, making training
faster and more stable.

2. Improved Convergence:

The tanh function is differentiable, making it a good candidate for training deep networks using gradient-
based optimization algorithms like stochastic gradient descent (SGD).

3. Gradient Descent Efficiency:

Unlike the sigmoid, which is constrained between 0 and 1, the tanh function’s output between -1 and 1
helps in better weight updates during training, leading to improved convergence speed.

Disadvantages of Tanh

1. Vanishing Gradient Problem:

Similar to the sigmoid function, tanh suffers from the vanishing gradient problem for large values of the
input (both positive and negative). When the input to the tanh function is very large or very small, the
gradient approaches zero, which can slow down or halt learning during backpropagation, especially in
deep networks.

2. Not Suitable for All Tasks:

13 / 21
Deep Learning Assignment [Link] 2025-03-30

While tanh works well in many cases, it might not be the best option for all types of neural network
architectures. For instance, ReLU (Rectified Linear Unit) has gained popularity for deep networks due to its
simplicity and efficiency in mitigating the vanishing gradient problem.

3. Sensitive to Outliers:

Extreme values in the input can lead to saturated regions where the gradient is close to zero, making
learning slow or ineffective. This could happen if the inputs to the tanh function are not scaled properly.

When to Use Tanh?

The tanh function is useful when:

The data is already normalized or centered around zero.

You are building shallow neural networks (i.e., networks with fewer layers).
You are working with data where negative values are significant and should be retained.

Reference: [Link]

Question 3: Exploring Loss Functions

Answer:
A loss function is a mathematical function that measures how well a model's predictions match the true
outcomes. It provides a quantitative metric for the accuracy of the model's predictions, which can be
used to guide the model's training process. The goal of a loss function is to guide optimization algorithms in
adjusting model parameters to reduce this loss over time.

Why are Loss Functions Important?

Loss functions are crucial because they:

1. Guide Model Training:

The loss function is the basis for the optimization process. During training, algorithms such as Gradient
Descent use the loss function to adjust the model's parameters, aiming to reduce the error and improve
the model’s predictions.

2. Measure Performance:

By quantifying the difference between predicted and actual values, the loss function provides a benchmark
for evaluating the model's performance. Lower loss values generally indicate better performance.

3. Influence Learning Dynamics:

The choice of loss function affects the learning dynamics, including how fast the model learns and what
kind of errors are penalized more heavily. Different loss functions can lead to different learning behaviors
and results.

14 / 21
Deep Learning Assignment [Link] 2025-03-30

Two common loss functions used in deep learning are:

1. Mean Squared Error (MSE)

2. Cross-Entropy Loss (CEL)

1. Mean Squared Error (MSE)

Mean Squared Error (MSE) is one of the most commonly used loss functions in regression tasks. It
calculates the average squared difference between actual values and predicted values, measuring the
model's performance.

Mathematical Definition

MSE is widely used due to its convex nature, making it easy to optimize with Gradient Descent. However, its
sensitivity to outliers can sometimes be a drawback.

How MSE Works (Step-by-Step Calculation)

Example Calculation Let's consider a simple dataset where a model is predicting house prices based on
square footage.

Step 1: Compute the Errors

Step 2: Square the Errors

15 / 21
Deep Learning Assignment [Link] 2025-03-30

Thus, the MSE = 175.

Interpretation: The model’s average squared error is 175 (in $1000s squared). A lower MSE means better
predictions.

Why MSE is Used?

1. Smooth and Convex Optimization

MSE is differentiable everywhere, ensuring that Gradient Descent converges efficiently Deep Learning by
Ian Goodfellow (MIT Press). This property is crucial in Deep Learning, where non-differentiable loss functions
can cause optimization problems.

2. Penalizes Large Errors More Heavily

Because the error is squared, larger errors have a bigger impact on the loss than smaller ones. This is useful
when large deviations must be minimized, such as in medical diagnosis models Machine Learning: A
Probabilistic Perspective by Kevin P. Murphy (MIT Press).

3. Works Well When Errors are Normally Distributed

MSE is the optimal estimator under a Gaussian noise assumption, meaning it works well when data follows
a normal distribution Pattern Recognition and Machine Learning by Christopher M. Bishop (Springer).

Limitations of MSE
1. Sensitive to Outliers

A single large error can dominate the total loss, leading to poor model performance U-Net: Convolutional
Networks for Biomedical Image Segmentation by Ronneberger et al. (arXiv). For example:

16 / 21
Deep Learning Assignment [Link] 2025-03-30

Here, the outlier at House #4 has inflated the MSE dramatically. In such cases, Huber Loss or Mean
Absolute Error (MAE) is preferred.

2. Scale Dependency

MSE is not scale-invariant, meaning the same model can have very different MSE values depending on the
dataset. For example:

Predicting house prices ($100,000s) results in a much larger MSE than predicting student test scores
(out of 100), even if both models perform similarly.

Solution? Use Root Mean Squared Error (RMSE) to get a comparable metric.

3. Does Not Differentiate Over- and Under-Predictions

Since squaring removes sign information, MSE does not indicate whether predictions overestimate or
underestimate actual values. Mean Bias Deviation (MBD) is often used alongside MSE to track prediction
trends.

Alternative Loss Functions

1. Mean Absolute Error (MAE)

More robust to outliers, but lacks smooth derivatives for optimization Deep Learning by Ian
Goodfellow (MIT Press).

2. Huber Loss (Combines MSE & MAE)

Less sensitive to outliers, widely used in robust regression models Machine Learning: A Probabilistic
Perspective by Kevin P. Murphy (MIT Press).

Real-World Applications of MSE

1. Weather Forecasting:

17 / 21
Deep Learning Assignment [Link] 2025-03-30

Predicting temperature, humidity, and rainfall using MSE as an accuracy metric. Source: American
Meteorological Society's Glossary of Meteorology

2. Financial Modeling:

Used in stock market price prediction models. Source: "Forecast Skill" - Wikipedia

3. Autonomous Vehicles:

MSE helps train deep learning models for object detection and lane tracking. A Real-Time Object Detector
for Autonomous Vehicles Based on Improved YOLOv2

4. Energy Load Forecasting:

In power grids, MSE helps predict electricity consumption trends, ensuring stable power supply
management. Source: "Load Forecasting Techniques and Their Applications in Smart Grids" by Lazos et al.

Conclusion
MSE is a powerful and widely used loss function in regression but has limitations in handling outliers and
scale dependency. While it is optimal for Gaussian-distributed errors, alternative loss functions like Huber
Loss or MAE should be considered for datasets with outliers or skewed distributions.

2. Cross-Entropy Loss (for Classification Problems)

1. Mathematical Understanding and Intuition
Cross-Entropy Loss is rooted in Information Theory, particularly from Shannon Entropy. It measures the
distance between two probability distributions: the true distribution (actual labels) and the predicted
probability distribution of the model.

Formula

Core Intuition

18 / 21
Deep Learning Assignment [Link] 2025-03-30

This forces the model to assign higher probabilities to the correct class, improving classification
accuracy.

Reference: Shannon's Information Theory (MIT OpenCourseWare)

2. Why Cross-Entropy Loss is Ideal for Classification?

a) Works in Probabilistic Space

Unlike Mean Squared Error (MSE), which treats errors linearly, CEL penalizes incorrect predictions
exponentially, pushing the model toward sharper decision boundaries.

b) Works Well with Softmax Activation

CEL is commonly used with Softmax in the output layer:

Softmax ensures:

The outputs form a probability distribution (sum to 1).

The largest probability corresponds to the predicted class.

Reference: Deep Learning by Ian Goodfellow (MIT Press)

c) Stronger Gradient Signal for Faster Convergence

Reference: Andrew Ng - Stanford Machine Learning Course

3. Why MSE is NOT Used for Classification?

19 / 21
Deep Learning Assignment [Link] 2025-03-30

a) Vanishing Gradient Problem

MSE has a quadratic term:

When used with Sigmoid/Softmax, the gradient becomes small as confidence increases.

This slows down training and can stall optimization.

Reference: Pattern Recognition and Machine Learning by Bishop

b) Incorrect Probability Interpretation

MSE treats errors linearly, failing to distinguish between slight and major misclassifications.

Example:

If a model predicts 0.49 instead of 0.51 for a binary class, MSE sees this as a small error, even though
it leads to misclassification.

CEL exponentially increases the penalty, ensuring better classification performance.

Reference: CS231n - Stanford CNN Course

4. Practical Applications of CEL

a) Image Classification (CNNs)

Used in deep learning models like:

ResNet, VGG, EfficientNet for object recognition.

Medical imaging (tumor detection, pneumonia classification).

Reference: ResNet: Deep Residual Learning for Image Recognition (CVPR 2016)

b) NLP (Natural Language Processing)

Sentiment Analysis (e.g., classifying reviews as positive/negative).

Spam Detection (filtering spam vs. non-spam emails).

Reference: BERT: Pre-training of Deep Bidirectional Transformers (Google Research)

c) Autonomous Systems

20 / 21
Deep Learning Assignment [Link] 2025-03-30

Self-driving cars classify objects like pedestrians, traffic lights, and vehicles.

Facial recognition in security and authentication.

Reference: YOLO: You Only Look Once - Object Detection (CVPR 2016)

5. Limitations of Cross-Entropy Loss

a) Overconfidence Issue

Deep networks can become too confident, making small mistakes costly.

Solution: Add Label Smoothing to prevent the model from assigning 100% probability to one class.

Reference: Label Smoothing Regularization (NeurIPS 2019)

b) Sensitivity to Noisy Labels

If the dataset contains incorrect labels, CEL still forces the model to fit them, leading to overfitting.

Solution: Use Generalized Cross-Entropy Loss (GCE).

Reference: Generalized Cross-Entropy Loss (ICLR 2019)

c) Class Imbalance Problem

If a dataset is highly imbalanced (e.g., 95% Class A, 5% Class B), CEL can fail because it optimizes for
the majority class.

Solution: Use Weighted Cross-Entropy or Focal Loss.

Reference: Focal Loss for Dense Object Detection (ICLR 2017)

Final Verdict
Cross-Entropy Loss is the gold standard for classification because:

Optimizes probabilistic models efficiently.

Works well with Softmax activation.
Provides strong gradients for better learning.
Punishes misclassifications effectively.

However, for real-world applications, techniques like:

Label Smoothing (to prevent overconfidence).

Weighted Cross-Entropy (for class imbalance).

Focal Loss (for hard-to-classify examples) are preferred.

Reference: Deep Learning Specialization (Coursera - Andrew Ng)

21 / 21

RNN and CNN: Deep Learning Essentials
No ratings yet
RNN and CNN: Deep Learning Essentials
23 pages
UNIT-2 DL
No ratings yet
UNIT-2 DL
51 pages
Unit-6 AML
No ratings yet
Unit-6 AML
24 pages
Understanding RNN and CNN: Foundations, Differences, and Applications
No ratings yet
Understanding RNN and CNN: Foundations, Differences, and Applications
5 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
7 pages
AIDS-II PT1 Question Bank
No ratings yet
AIDS-II PT1 Question Bank
27 pages
2111CS010077 Deep Learning
No ratings yet
2111CS010077 Deep Learning
10 pages
DLT Unit 2
No ratings yet
DLT Unit 2
66 pages
Unit 1
No ratings yet
Unit 1
54 pages
Module 1 Part 3
No ratings yet
Module 1 Part 3
19 pages
Aids Ii
No ratings yet
Aids Ii
42 pages
DeepLear Qes
No ratings yet
DeepLear Qes
9 pages
Assignment 4
No ratings yet
Assignment 4
46 pages
Unit 4
No ratings yet
Unit 4
13 pages
DLA Unit 4
No ratings yet
DLA Unit 4
38 pages
CNNs and RNNs: Overview and Applications
No ratings yet
CNNs and RNNs: Overview and Applications
31 pages
Lecture 3 V33
No ratings yet
Lecture 3 V33
52 pages
2630 20230529 Mahdi Momen Aldawood HH 15261 946399124
No ratings yet
2630 20230529 Mahdi Momen Aldawood HH 15261 946399124
11 pages
CNN & RNN Explained for Beginners
No ratings yet
CNN & RNN Explained for Beginners
8 pages
CNN, RNN
No ratings yet
CNN, RNN
60 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
8 pages
ML Unit 6
No ratings yet
ML Unit 6
2 pages
DL Unit-Ii
No ratings yet
DL Unit-Ii
36 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
Slides CNN Unit 3
No ratings yet
Slides CNN Unit 3
36 pages
ENG6500 8 DL IntroductionToDeepLearning Part2
No ratings yet
ENG6500 8 DL IntroductionToDeepLearning Part2
65 pages
Difference Between ANN, CNN and RNN
100% (1)
Difference Between ANN, CNN and RNN
5 pages
Understanding Recurrent Neural Networks
100% (1)
Understanding Recurrent Neural Networks
34 pages
Artificial Neural Networks Guide
100% (2)
Artificial Neural Networks Guide
45 pages
Partiiunit6types of Neural Neywork
No ratings yet
Partiiunit6types of Neural Neywork
8 pages
Deep Learning: CNN, RNN, GAN Guide
No ratings yet
Deep Learning: CNN, RNN, GAN Guide
5 pages
? What Is A Convolutional Neural Network
No ratings yet
? What Is A Convolutional Neural Network
3 pages
Basic Models of Artificial Neural Networks
No ratings yet
Basic Models of Artificial Neural Networks
5 pages
Unit Iv (CNN)
No ratings yet
Unit Iv (CNN)
8 pages
Deep Learning for Beginners
No ratings yet
Deep Learning for Beginners
13 pages
1725888953module 3 Popular Deep Learning Architectures
No ratings yet
1725888953module 3 Popular Deep Learning Architectures
14 pages
Difference Between ANN, CNN and RNN
No ratings yet
Difference Between ANN, CNN and RNN
4 pages
Reviewer - Convolutional Neural Networks (CNNS) - Muqaddas Bin Tahir
No ratings yet
Reviewer - Convolutional Neural Networks (CNNS) - Muqaddas Bin Tahir
8 pages
Deep Learning Techniques and Architectures
No ratings yet
Deep Learning Techniques and Architectures
35 pages
Deep Learning Image Classification
No ratings yet
Deep Learning Image Classification
11 pages
Neural Networks: A Comprehensive Guide
No ratings yet
Neural Networks: A Comprehensive Guide
10 pages
Unit 1 GEN AI
No ratings yet
Unit 1 GEN AI
61 pages
DL 4
No ratings yet
DL 4
4 pages
Class Notes Unit 5
No ratings yet
Class Notes Unit 5
13 pages
Analysing 3 Networks
No ratings yet
Analysing 3 Networks
30 pages
Different Artificial Neural Networks Architectures
No ratings yet
Different Artificial Neural Networks Architectures
27 pages
Neural Network
No ratings yet
Neural Network
10 pages
RNN Vs CNN - Key Differences in Deep Learning
No ratings yet
RNN Vs CNN - Key Differences in Deep Learning
29 pages
Aquino Dominic Bien FA2.2
No ratings yet
Aquino Dominic Bien FA2.2
3 pages
CNN and RNN Applications in AI
No ratings yet
CNN and RNN Applications in AI
41 pages
DL Notes-Merged
No ratings yet
DL Notes-Merged
13 pages
Neural Networks: Applications & Learning
No ratings yet
Neural Networks: Applications & Learning
6 pages
Comprehensive
No ratings yet
Comprehensive
14 pages
Slide 1
No ratings yet
Slide 1
5 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
49 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
19 pages
Module 05
No ratings yet
Module 05
10 pages
Dip 7
No ratings yet
Dip 7
4 pages
CV PPT Mt101
No ratings yet
CV PPT Mt101
16 pages
Fuzzy Logic & Neural Networks
No ratings yet
Fuzzy Logic & Neural Networks
8 pages
ML Hon Exp1
No ratings yet
ML Hon Exp1
13 pages
Module 2
No ratings yet
Module 2
37 pages
Deep Learning for Computer Vision
No ratings yet
Deep Learning for Computer Vision
140 pages
Clevert, Unterthiner, Hochreiter - 2016 - Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
No ratings yet
Clevert, Unterthiner, Hochreiter - 2016 - Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
14 pages
An Introduction To ML - ANN For Image Processing
No ratings yet
An Introduction To ML - ANN For Image Processing
59 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
DQN with Keras for Flappy Bird
No ratings yet
DQN with Keras for Flappy Bird
21 pages
Lecture Notes in Deep Learning 2025 2026
No ratings yet
Lecture Notes in Deep Learning 2025 2026
318 pages
Image Enhancement Effect On The Performance of Convolutional Neural Networks
No ratings yet
Image Enhancement Effect On The Performance of Convolutional Neural Networks
40 pages
Lecture 8 Deep Learning Overview PDF
No ratings yet
Lecture 8 Deep Learning Overview PDF
98 pages
000-Quantile Regression With ReLU Networks Estimators and Minimax Rates
No ratings yet
000-Quantile Regression With ReLU Networks Estimators and Minimax Rates
42 pages
Af GF
No ratings yet
Af GF
7 pages
Predicting Image Credibility in Fake News Over Social Media Using Multi-Modal Approach
No ratings yet
Predicting Image Credibility in Fake News Over Social Media Using Multi-Modal Approach
15 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
UnderstandingDeepLearning 03-26-25 C 39 54
No ratings yet
UnderstandingDeepLearning 03-26-25 C 39 54
16 pages
Activation Funtions
No ratings yet
Activation Funtions
26 pages
1 - Introduction To Deep Learning
No ratings yet
1 - Introduction To Deep Learning
87 pages
Homework2 Solution
No ratings yet
Homework2 Solution
5 pages
Activation Functions
No ratings yet
Activation Functions
5 pages
1) With The Help of Suitable Diagram Explain Biological Neuron
No ratings yet
1) With The Help of Suitable Diagram Explain Biological Neuron
8 pages
Introduction To CNN
No ratings yet
Introduction To CNN
13 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
36 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
Bumps and Pothole Detection Report Final
No ratings yet
Bumps and Pothole Detection Report Final
64 pages
Alex Net Stanley
No ratings yet
Alex Net Stanley
12 pages
Gradient Exploding Vanishing Problem v2
No ratings yet
Gradient Exploding Vanishing Problem v2
3 pages
Lecture - 05 (Introduction To ANN)
No ratings yet
Lecture - 05 (Introduction To ANN)
27 pages
Deep Learning Unit 2 GPT
No ratings yet
Deep Learning Unit 2 GPT
23 pages
TaterwalDeepit Spring2021
No ratings yet
TaterwalDeepit Spring2021
38 pages
ANN Notes
No ratings yet
ANN Notes
7 pages