Deep Learning Assignment 01
Deep Learning Assignment 01
md 2025-03-30
Date: [30-03-2025]
Answer:
Neural Network Architecture:
The architecture of neural networks is made up of an input, output, and hidden layer. Neural networks
themselves, or artificial neural networks (ANNs), are a subset of machine learning designed to mimic the
processing power of a human brain. Neural networks function by passing data through the layers of an
artificial neuron.
A typical CNN architecture is made up of three main components: the input layer, the hidden layers, and the
output layer. The input layer receives the input image and passes it to the hidden layers, which are made up of
multiple convolutional and pooling layers. The output layer provides the predicted class label or probability
scores for each class.
A common architecture for a CNN is to have multiple convolutional layers, followed by one or more pooling
layers, and then a fully connected layer that provides the final output.
The layers of a Convolutional Neural Network (CNN) can be broadly classified into the following categories:
Convolutional Layer:
1 / 21
Deep Learning Assignment [Link] 2025-03-30
The convolutional layer is responsible for extracting features from the input image. It performs a convolution
operation on the input image, where a filter or kernel is applied to the image to identify and extract specific
features.
Pooling Layer:
The pooling layer is responsible for reducing the spatial dimensions of the feature maps produced by the
convolutional layer. It performs a down-sampling operation to reduce the size of the feature maps and reduce
computational complexity.
Activation Layer:
The activation layer applies a non-linear activation function, such as the ReLU function, to the output of the
pooling layer. This function helps to introduce non-linearity into the model, allowing it to learn more complex
representations of the input data.
2 / 21
Deep Learning Assignment [Link] 2025-03-30
The fully connected layer is a traditional neural network layer that connects all the neurons in the previous
layer to all the neurons in the next layer. This layer is responsible for combining the features learned by the
convolutional and pooling layers to make a prediction.
CNNs are widely used for facial recognition in security systems, such as Apple’s Face ID and Amazon
Rekognition. In social media, platforms like Facebook and Instagram use CNNs for automatic tagging and
content moderation by detecting inappropriate images or videos.
2. Autonomous Vehicles
Self-driving car companies like Tesla, Waymo, and NVIDIA use CNN-based deep learning models to
analyze road conditions, detect obstacles, and interpret traffic signs in real-time. This technology improves the
safety and efficiency of autonomous navigation.
3. Healthcare Imaging
CNNs are revolutionizing medical imaging by enabling early disease detection. Companies like Google
Health and IBM Watson use CNNs to analyze X-rays, MRIs, and CT scans to detect diseases such as cancer,
pneumonia, and diabetic retinopathy with higher accuracy than traditional methods.
Banks and financial institutions like JP Morgan and Mastercard use CNNs to analyze transaction patterns
and detect anomalies that indicate fraud. CNNs can identify suspicious activities in credit card transactions
and stock market trading by learning from vast financial datasets.
CNNs power recommendation systems in platforms like Amazon, Netflix, and Alibaba by analyzing user
behavior. These models suggest personalized products, improving customer engagement and increasing
3 / 21
Deep Learning Assignment [Link] 2025-03-30
sales. In fashion retail, CNNs help companies like Zalando and ASOS with virtual try-ons and clothing
recognition.
Manufacturing companies like Siemens, General Electric (GE), and Bosch use CNNs for predictive
maintenance. CNN models analyze sensor data and machine images to detect potential failures before
they occur, reducing downtime and increasing operational efficiency.
The input layer x receives and processes the neural network’s input before passing it on to the middle layer.
In the middle layer h, multiple hidden layers can be found, each with its activation functions, weights, and
biases. You can utilize a recurrent neural network if the various parameters of different hidden layers are not
impacted by the preceding layer, i.e., if There is no memory in the neural network.
4 / 21
Deep Learning Assignment [Link] 2025-03-30
RNN Architecture
RNNs are a type of neural network with hidden states and allow past outputs to be used as inputs. They
usually go like this:
Input Layer: This layer receives the initial element of the sequence data. For example, in a sentence, it
might receive the first word as a vector representation.
Hidden Layer: The heart of the RNN, the hidden layer contains a set of interconnected neurons. Each
neuron processes the current input along with the information from the previous hidden layer’s
state. This “state” captures the network’s memory of past inputs, allowing it to understand the current
element in context.
Activation Function: This function introduces non-linearity into the network, enabling it to learn
complex patterns. It transforms the combined input from the current input layer and the previous
hidden layer state before passing it on.
Output Layer: The output layer generates the network’s prediction based on the processed
information. In a language model, it might predict the next word in the sequence.
Recurrent Connection: A key distinction of RNNs is the recurrent connection within the hidden
layer. This connection allows the network to pass the hidden state information (the network’s
memory) to the next time step. It’s like passing a baton in a relay race, carrying information about
previous inputs forward.
Reference: [Link]
networks-rnn/
Input Processes data as a single, static Processes sequential data, where each input is related
Processing input vector to the previous ones
5 / 21
Deep Learning Assignment [Link] 2025-03-30
1. Speech Recognition
Virtual assistants like Apple’s Siri, Amazon Alexa, and Google Assistant use RNNs to process spoken
language and generate appropriate responses. These systems analyze voice inputs in real-time to convert
them into text and understand intent.
2. Machine Translation
Google Translate, DeepL, and Microsoft Translator use RNN-based models (like LSTMs and GRUs) to
translate languages by understanding sentence structure, grammar, and context. These models provide
context-aware translations rather than word-to-word conversions.
Chatbots like OpenAI’s ChatGPT, Meta’s BlenderBot, and Google’s Bard use advanced RNN models to
generate text, hold conversations, and even assist in content creation. AI-driven writing tools like Jasper AI
and [Link] leverage RNNs for automated blog writing, email drafting, and creative storytelling.
Financial firms like Goldman Sachs and Bloomberg use RNNs to analyze historical stock market data and
predict future trends. Similarly, weather forecasting agencies like The National Weather Service (NWS)
and AccuWeather use RNNs to analyze past climate data for more accurate weather predictions.
5. Music Generation
AI music platforms like OpenAI’s MuseNet, Google Magenta, and AIVA use RNNs to compose new music by
learning from thousands of existing pieces. These models can generate melodies, harmonies, and even full
compositions in different styles.
6 / 21
Deep Learning Assignment [Link] 2025-03-30
YouTube, Facebook, and Netflix use RNN-based models for automatic caption generation. These models
analyze video content and generate subtitles, making video browsing more accessible to people with hearing
impairments.
Banks like JPMorgan Chase and PayPal use RNNs for fraud detection by analyzing transaction sequences
and detecting unusual spending patterns. Similarly, cybersecurity firms like Darktrace and CrowdStrike
employ RNNs to monitor network traffic and identify suspicious activity.
Social media platforms like Twitter and Facebook use RNNs for sentiment analysis, helping companies track
public opinion and brand perception. Companies like Amazon and Yelp analyze customer reviews to
understand user satisfaction and improve their services.
Investment platforms like Robinhood and Bloomberg Terminal use RNNs to analyze market trends, news,
and economic indicators to provide investment recommendations and automated trading strategies.
Biotech companies like Deep Genomics and Illumina use RNNs to study DNA sequences and predict
mutations, helping researchers identify genetic disorders, drug interactions, and disease risks.
Handle sequential data effectively, including text, speech, and time series.
Disadvantages:
7 / 21
Deep Learning Assignment [Link] 2025-03-30
Referenc: [Link]
architectures-f571dd6a39c7
Two commonly used activation functions beyond Sigmoid are ReLU and Tanh.
The Rectified Linear Unit (ReLU) activation function has become the default choice in deep learning
architectures due to its simplicity and effectiveness. It is widely used in Convolutional Neural Networks
(CNNs) and other deep learning models because of its ability to introduce non-linearity while maintaining
computational efficiency. This document analyzes ReLU, discussing its advantages, limitations, and possible
alternatives.
Mathematical Definition
8 / 21
Deep Learning Assignment [Link] 2025-03-30
where:
This simple thresholding operation makes ReLU computationally efficient compared to traditional
activation functions like sigmoid and tanh, which require expensive exponential calculations (Goodfellow et
al., 2016).
Advantages of ReLU
1. Computational Efficiency
Unlike sigmoid and tanh, ReLU does not require exponential calculations. It only involves a
simple comparison, making it highly efficient (Nair & Hinton, 2010).
The vanishing gradient problem occurs when gradients become too small during
backpropagation, preventing deep networks from learning effectively. Since ReLU’s derivative is
either 1 (for x > 0) or 0 (for x ≤ 0), it avoids gradient shrinkage for positive inputs (Glorot et al.,
2011).
3. Sparse Activation
Many neurons output zero due to ReLU’s thresholding behavior, which leads to sparsity in
activations. This sparsity helps in reducing overfitting and enhancing model generalization
(Glorot et al., 2011).
Empirical studies show that networks trained with ReLU tend to converge faster compared to
those using sigmoid or tanh (He et al., 2015).
Limitations of ReLU
9 / 21
Deep Learning Assignment [Link] 2025-03-30
If a neuron’s weights are updated in such a way that it always outputs a negative value, it will
permanently output zero, effectively becoming inactive. This is known as the dying ReLU
problem (Goodfellow et al., 2016).
Studies show that up to 40% of neurons can become inactive in certain networks if the learning
rate is too high (Maas et al., 2013).
2. Unbounded Outputs
While ReLU does not suffer from vanishing gradients, its unbounded nature can lead to
exploding gradients in deep networks. This often requires additional techniques such as
gradient clipping or batch normalization (Ioffe & Szegedy, 2015).
Similar to Leaky ReLU, but the slope alpha is learned during training instead of being fixed.
ELU allows small negative outputs instead of zero, helping to smooth gradient updates:
Use Leaky ReLU or PReLU if you notice many neurons becoming inactive.
10 / 21
Deep Learning Assignment [Link] 2025-03-30
Conclusion
ReLU is a powerful activation function that has revolutionized deep learning by providing computational
efficiency, better convergence, and reduced vanishing gradients. However, it also has some limitations,
particularly the dying neuron problem. Alternative versions like Leaky ReLU, PReLU, and ELU can help
mitigate these issues. Proper learning rate tuning and batch normalization are recommended when using
ReLU in deep networks.
Where:
11 / 21
Deep Learning Assignment [Link] 2025-03-30
The tanh function has several advantages that make it widely used in neural networks:
1. Non-linearity:
Tanh introduces non-linearity to the model, which allows neural networks to learn complex patterns and
relationships in the data. Without non-linear activation functions, a neural network would essentially behave
as a linear model, no matter how many layers it has.
The output of the tanh function is centered around 0, unlike the sigmoid function, which outputs values
between 0 and 1. This makes the tanh activation function more useful for many types of tasks, as the mean of
the output is closer to zero, leading to more efficient training and faster convergence.
3. Gradient Behavior:
Tanh helps mitigate the vanishing gradient problem (to some extent), especially when compared to
sigmoid activation. This is because the gradient of the tanh function is generally higher than that of the
sigmoid, enabling better weight updates during backpropagation.
Derivative of Tanh
The derivative of the tanh function is also useful in the backpropagation step of training neural networks:
12 / 21
Deep Learning Assignment [Link] 2025-03-30
The derivative is always positive and is maximum at 0, which helps with gradient-based optimization.
However, just like the sigmoid, tanh also suffers from the vanishing gradient problem when the input values
are too large or too small. This can cause gradients to become very small, leading to slower training in
deeper networks.
Since the output is centered around zero, the network has a better chance of balancing the weights. This
helps in ensuring that the gradients don't just keep increasing or decreasing in magnitude, making training
faster and more stable.
2. Improved Convergence:
The tanh function is differentiable, making it a good candidate for training deep networks using gradient-
based optimization algorithms like stochastic gradient descent (SGD).
Unlike the sigmoid, which is constrained between 0 and 1, the tanh function’s output between -1 and 1
helps in better weight updates during training, leading to improved convergence speed.
Disadvantages of Tanh
Similar to the sigmoid function, tanh suffers from the vanishing gradient problem for large values of the
input (both positive and negative). When the input to the tanh function is very large or very small, the
gradient approaches zero, which can slow down or halt learning during backpropagation, especially in
deep networks.
13 / 21
Deep Learning Assignment [Link] 2025-03-30
While tanh works well in many cases, it might not be the best option for all types of neural network
architectures. For instance, ReLU (Rectified Linear Unit) has gained popularity for deep networks due to its
simplicity and efficiency in mitigating the vanishing gradient problem.
3. Sensitive to Outliers:
Extreme values in the input can lead to saturated regions where the gradient is close to zero, making
learning slow or ineffective. This could happen if the inputs to the tanh function are not scaled properly.
Reference: [Link]
The loss function is the basis for the optimization process. During training, algorithms such as Gradient
Descent use the loss function to adjust the model's parameters, aiming to reduce the error and improve
the model’s predictions.
2. Measure Performance:
By quantifying the difference between predicted and actual values, the loss function provides a benchmark
for evaluating the model's performance. Lower loss values generally indicate better performance.
The choice of loss function affects the learning dynamics, including how fast the model learns and what
kind of errors are penalized more heavily. Different loss functions can lead to different learning behaviors
and results.
14 / 21
Deep Learning Assignment [Link] 2025-03-30
Mathematical Definition
MSE is widely used due to its convex nature, making it easy to optimize with Gradient Descent. However, its
sensitivity to outliers can sometimes be a drawback.
15 / 21
Deep Learning Assignment [Link] 2025-03-30
Interpretation: The model’s average squared error is 175 (in $1000s squared). A lower MSE means better
predictions.
MSE is differentiable everywhere, ensuring that Gradient Descent converges efficiently Deep Learning by
Ian Goodfellow (MIT Press). This property is crucial in Deep Learning, where non-differentiable loss functions
can cause optimization problems.
Because the error is squared, larger errors have a bigger impact on the loss than smaller ones. This is useful
when large deviations must be minimized, such as in medical diagnosis models Machine Learning: A
Probabilistic Perspective by Kevin P. Murphy (MIT Press).
MSE is the optimal estimator under a Gaussian noise assumption, meaning it works well when data follows
a normal distribution Pattern Recognition and Machine Learning by Christopher M. Bishop (Springer).
Limitations of MSE
1. Sensitive to Outliers
A single large error can dominate the total loss, leading to poor model performance U-Net: Convolutional
Networks for Biomedical Image Segmentation by Ronneberger et al. (arXiv). For example:
16 / 21
Deep Learning Assignment [Link] 2025-03-30
Here, the outlier at House #4 has inflated the MSE dramatically. In such cases, Huber Loss or Mean
Absolute Error (MAE) is preferred.
2. Scale Dependency
MSE is not scale-invariant, meaning the same model can have very different MSE values depending on the
dataset. For example:
Predicting house prices ($100,000s) results in a much larger MSE than predicting student test scores
(out of 100), even if both models perform similarly.
Solution? Use Root Mean Squared Error (RMSE) to get a comparable metric.
Since squaring removes sign information, MSE does not indicate whether predictions overestimate or
underestimate actual values. Mean Bias Deviation (MBD) is often used alongside MSE to track prediction
trends.
More robust to outliers, but lacks smooth derivatives for optimization Deep Learning by Ian
Goodfellow (MIT Press).
Less sensitive to outliers, widely used in robust regression models Machine Learning: A Probabilistic
Perspective by Kevin P. Murphy (MIT Press).
17 / 21
Deep Learning Assignment [Link] 2025-03-30
Predicting temperature, humidity, and rainfall using MSE as an accuracy metric. Source: American
Meteorological Society's Glossary of Meteorology
2. Financial Modeling:
Used in stock market price prediction models. Source: "Forecast Skill" - Wikipedia
3. Autonomous Vehicles:
MSE helps train deep learning models for object detection and lane tracking. A Real-Time Object Detector
for Autonomous Vehicles Based on Improved YOLOv2
In power grids, MSE helps predict electricity consumption trends, ensuring stable power supply
management. Source: "Load Forecasting Techniques and Their Applications in Smart Grids" by Lazos et al.
Conclusion
MSE is a powerful and widely used loss function in regression but has limitations in handling outliers and
scale dependency. While it is optimal for Gaussian-distributed errors, alternative loss functions like Huber
Loss or MAE should be considered for datasets with outliers or skewed distributions.
Formula
Core Intuition
18 / 21
Deep Learning Assignment [Link] 2025-03-30
This forces the model to assign higher probabilities to the correct class, improving classification
accuracy.
Unlike Mean Squared Error (MSE), which treats errors linearly, CEL penalizes incorrect predictions
exponentially, pushing the model toward sharper decision boundaries.
Softmax ensures:
When used with Sigmoid/Softmax, the gradient becomes small as confidence increases.
MSE treats errors linearly, failing to distinguish between slight and major misclassifications.
Example:
If a model predicts 0.49 instead of 0.51 for a binary class, MSE sees this as a small error, even though
it leads to misclassification.
Reference: ResNet: Deep Residual Learning for Image Recognition (CVPR 2016)
c) Autonomous Systems
20 / 21
Deep Learning Assignment [Link] 2025-03-30
Self-driving cars classify objects like pedestrians, traffic lights, and vehicles.
Reference: YOLO: You Only Look Once - Object Detection (CVPR 2016)
Deep networks can become too confident, making small mistakes costly.
Solution: Add Label Smoothing to prevent the model from assigning 100% probability to one class.
If the dataset contains incorrect labels, CEL still forces the model to fit them, leading to overfitting.
If a dataset is highly imbalanced (e.g., 95% Class A, 5% Class B), CEL can fail because it optimizes for
the majority class.
Final Verdict
Cross-Entropy Loss is the gold standard for classification because:
21 / 21