Deep Learning Report
Deep Learning Report
Table of Contents:
2. Historical Evolution
2.1. Foundational Exploration (1950s–1980s)
2.2. Revival and Theoretical Advances (1990s–2000s)
2.3. Data and Hardware Revolution (2000s)
2.4. Breakthrough Architectures (2010s–Present)
6. Future Perspectives
6.1. Emerging Trends in Deep Learning
6.2. Ethical and Societal Implications
6.3. Innovations in Computational Techniques
7. References
Deep learning represents a transformative approach within the broader field of artificial
intelligence (AI). By leveraging artificial neural networks that simulate the structure and
function of the human brain, deep learning excels at analyzing vast amounts of data to
uncover patterns and make decisions.
3. Complex Task Handling: Deep learning powers advanced applications such as:
o Image recognition
o Language translation
o Speech synthesis
o Autonomous driving
5. Scalability: Its performance scales significantly with larger datasets and higher
computational power, benefiting from advancements in GPU and TPU technology.
Deep learning’s development is rooted in early AI research but gained momentum due to
technological advancements:
1. Understanding: It can extract meaningful insights from complex data formats such
as text, images, and audio. This capability powers applications like sentiment
analysis, medical imaging diagnostics, and audio transcription.
3. Automation: Deep learning reduces the need for human intervention in data-driven
tasks, enabling automation in areas like robotic process automation, autonomous
vehicles, and smart home systems.
This foundation positions deep learning as a pivotal technology shaping industries from
healthcare to finance and beyond. It continues to unlock new possibilities by tackling
complex challenges with unprecedented accuracy and e>iciency.
Deep learning operates through artificial neural networks that process data across multiple
layers. Each layer extracts progressively complex features from the input, enabling the
system to handle intricate patterns and relationships e>ectively. Below is a detailed
breakdown of its operational principles:
Mathematical Approach:
1. Forward Propagation:
! = # $! ⋅ &! + (
!#$
§ Here, ! is the weighted sum of inputs, ww represents weights, xx is the
input, and bb is the bias.
2. Loss Calculation:
o The loss measures the di>erence between predicted outputs and actual
targets. Common loss functions include:
o Cross-Entropy Loss:
(
. = − # 1! log(13)
%
!#$
3. Backpropagation:
o Gradients of the loss function with respect to weights are computed using
the chain rule:
7. 7. 7- 7!
= ⋅ ⋅
7$ 7- 7! 7$
1. Supervised Learning:
2. Unsupervised Learning:
3. Reinforcement Learning:
o Models learn through trial-and-error by interacting with dynamic
environments and receiving rewards or penalties.
Convolutional Neural Networks (CNNs) are a specialized class of deep learning models
designed to process grid-like data structures, such as images and videos. By leveraging
convolutional operations, CNNs e>ectively capture spatial hierarchies and patterns within
the data, making them the backbone of computer vision applications like image
recognition, object detection, and medical imaging analysis.
A CNN typically consists of multiple layers designed to progressively extract features from
input data. These layers include convolutional layers, activation functions, pooling layers,
and fully connected layers.
1. Convolutional Layer
The convolutional layer is the core building block of a CNN. It applies filters (kernels)
to the input data to detect local patterns such as edges, textures, or shapes. The
mathematical operation performed is:
+
!!,* = (& ∗ $ + )!,* + ( +
Where:
+
o !!,* is the output of the <-th filter at position (=, ?)
The result of this operation is a feature map, which highlights regions in the input that
correspond to the learned pattern of the filter.
2. Activation Function
After convolution, an activation function, typically ReLU (Rectified Linear Unit), is
applied to introduce non-linearity:
ReLU(!) = max(0, !)
This ensures that the network can model complex, non-linear relationships in the data.
3. Pooling Layer
Pooling layers reduce the spatial dimensions of feature maps, making the model
computationally e>icient and less prone to overfitting. The most common pooling
method is max pooling:
!!,* = maxH{!'," }K
Here, !'," represents the values within a pooling window, and the maximum value is
selected as the output.
CNNs excel at detecting spatial features in images, making them indispensable for
computer vision tasks. By sharing parameters (filters) across spatial locations, they
significantly reduce the number of learnable parameters compared to fully connected
networks, improving generalization and computational e>iciency. However, CNNs can be
computationally expensive for very large images or datasets, and they may struggle with
invariance to rotations or scale changes in the input.
Applications of CNNs
Recurrent Neural Networks (RNNs) are a type of neural network architecture designed to
process sequential data, where the order of the data points is essential. RNNs are capable
of maintaining an internal memory by leveraging feedback connections, allowing them to
use information from previous steps to inform the current computation. This makes them
particularly suited for tasks such as natural language processing, time series forecasting,
and speech recognition.
At each time step tt, an RNN takes an input vector xtx_t and updates its hidden state hth_t,
which serves as the network's memory of past inputs. The hidden state is computed as
follows:
The output 1, at each time step is then calculated based on the hidden state:
1, = P.0 ℎ, + (0
Where:
To address the challenges of standard RNNs, advanced variants such as Long Short-Term
Memory (LSTM) networks and Gated Recurrent Units (GRUs) have been introduced. These
architectures incorporate gating mechanisms to control the flow of information. For
instance, LSTMs use three gates—input, forget, and output gates—to manage what
information to add, forget, or output from their memory cells, enabling them to capture
long-term dependencies e>ectively.
Applications of RNNs
RNNs are widely used in natural language processing tasks such as machine translation,
sentiment analysis, and text generation. In time series analysis, they are employed for
tasks like stock price prediction and weather forecasting. Additionally, in speech
recognition, RNNs process audio signals sequentially to transcribe spoken language into
text.
Despite their limitations with long sequences, RNNs have been foundational in advancing
deep learning for sequential data. While newer architectures like Transformers have
surpassed them in many tasks, RNNs remain a crucial concept and are still used in
applications where sequential modeling is essential.
Transformers
Transformers have revolutionized the field of natural language processing (NLP) and
sequential data tasks. Unlike traditional recurrent neural networks (RNNs), which process
data sequentially, transformers use self-attention mechanisms that allow them to consider
all elements of a sequence simultaneously. This enables them to capture long-range
dependencies more e>iciently, making them highly e>ective for tasks like language
modeling, text generation, translation, and even image processing.
Transformers rely heavily on a mechanism called self-attention, which allows the model to
weigh the importance of di>erent words (or elements) in a sequence relative to each other,
regardless of their positions. This is done through multiple layers of attention and
feedforward neural networks. The architecture consists of two main parts: the encoder
and the decoder.
1. Self-Attention Mechanism
The self-attention mechanism computes attention scores for each word in a sequence with
respect to every other word. For each word in the input, self-attention computes a weighted
sum of all words, where the weights reflect how much focus the model should place on
each word relative to the current word. The formula for computing the attention score is as
follows:
TU 1
Attention(T, U, V) = softmax Y \V
Z[+
Where:
The result of this operation is a weighted sum of the values, which is passed to the next
layer.
2. Positional Encoding
Since transformers do not inherently capture the order of the sequence, positional
encodings are added to the input embeddings to inject information about the position of
each element in the sequence. The positional encoding PEPE is typically computed using
sine and cosine functions of di>erent frequencies:
`ab
]^(345,&!) = sin _ c
10000&!/8
`ab
]^(345,&!9$) = cos _ c
10000&!/8
Where:
These encodings are added to the input embeddings to provide the model with information
about the relative positions of words in the sequence.
3. Multi-Head Attention
After the self-attention layer, each attention output is passed through a position-wise
feedforward neural network, which consists of two layers with a ReLU activation in
between. This enables the model to learn complex, non-linear transformations of the
attention outputs.
5. Encoder-Decoder Architecture
The encoder and decoder are stacked in multiple layers, allowing the model to learn
hierarchical representations of the input and output sequences.
Strengths and Challenges
Transformers are highly e>icient at capturing long-range dependencies due to their self-
attention mechanism, which allows them to process all parts of the sequence
simultaneously. This makes them more parallelizable than RNNs, enabling faster training
and more e>ective handling of long sequences. Additionally, transformers can scale better
as the data size increases, making them highly e>ective for large datasets.
Applications of Transformers
Transformers have transformed the field of natural language processing and beyond:
• Language Modeling and Text Generation: Models like GPT and BERT are based on
transformers and have been used for a wide range of NLP tasks, including text
generation, summarization, and question answering.
• Text Classification and Sentiment Analysis: Transformers are used for classifying
text data, such as detecting sentiment in social media posts or categorizing news
articles.
Transformers have redefined the landscape of deep learning by enabling more e>icient
handling of sequential data, especially in NLP. Their ability to model complex relationships
in data has led to state-of-the-art results in various fields, making them the go-to
architecture for many modern AI systems.
This project focuses on handwritten character recognition using neural networks,
specifically trained on the EMNIST dataset. The code begins by setting up the environment,
clearing any previous variables or figures, and loading the pre-trained neural network from
a file named [Link]. The necessary folders, such as those containing test images,
are added to the path to ensure smooth access to resources.
The first step is preprocessing the input image, which is read from a file called [Link].
This image, initially in RGB format, is displayed to show its original state. It is then
converted into a grayscale image, simplifying the data for further processing. After this,
adaptive thresholding is applied to transform the grayscale image into a binary format,
where the text appears white on a black background. Noise is reduced by removing small
objects containing fewer than 30 pixels. The modified image is displayed, and bounding
boxes are drawn around the connected components, visually segmenting the individual
characters or regions of interest.
Next, the segmented regions are processed to extract individual characters. Each
character is resized to a standard dimension of 128x128 pixels and smoothed using a
Gaussian filter to remove any irregularities. It is then further resized to 20x20 pixels and
padded symmetrically to create a uniform size of 28x28 pixels, matching the input
requirements of the neural network. These preprocessed character images are saved as
individual files in a folder named segmentedImages.
Once the characters are prepared, they are fed into the neural network for recognition.
Each character image is reshaped into a vector format suitable for the network’s input
layer. The network processes the image and outputs a probability distribution across all
possible character classes. The character with the highest probability is selected as the
predicted output. This label is then mapped to the corresponding character (either a digit,
an uppercase letter, or a selected lowercase letter) using the imageLabeler function. This
function ensures consistency with the EMNIST dataset’s label structure.
Finally, the detected characters are concatenated to form the complete text, which is
displayed in the MATLAB command window. This marks the culmination of the process,
where a handwritten image is successfully converted into digital text using a combination
of image processing and neural network techniques. The project showcases the practical
application of neural networks in solving real-world problems like handwritten character
recognition, integrating multiple steps seamlessly from preprocessing to final output.
The net object represents the neural network and contains fields that define its
architecture, parameters, and functions. It includes general information about the network,
such as the version, name, e>iciency metrics, and any user-defined metadata stored in the
userdata field.
Visualization of The Neural Networks in the [Link] file using this python code:
The network's architecture is defined by several parameters, including the number of input
nodes (numInputs), layers (numLayers), and output nodes (numOutputs). Additionally, it
specifies delays, such as input delays (numInputDelays), layer delays (numLayerDelays),
and feedback delays (numFeedbackDelays). The total number of weights in the network is
indicated by the numWeightElements field.
Connections and weights are a critical part of the network's structure. These include the
bias configuration (biasConnect), input connections (inputConnect), connections between
layers (layerConnect), and output connections (outputConnect). The weights applied to
inputs (inputWeights) and within layers (layerWeights) are also detailed, along with the bias
values (biases) for each layer.
The network's functions and parameters are equally important. It uses a specific training
algorithm (trainFcn) with associated training parameters (trainParam). The network's
performance is evaluated using a performance function (performFcn) and its parameters
(performParam). Other functions include the adaptation function (adaptFcn), which
adjusts weights during training, the data division method (divideFcn) for splitting datasets,
and the initialization function (initFcn) for setting initial weights and biases. The gradient
function (gradientFcn) is responsible for calculating gradients during optimization.
Finally, the network contains additional elements like the input weights matrix (IW), which
connects inputs to layers, and the layer weights matrix (LW), which connects layers to each
other. Bias values for each layer are stored in the b field. Visualization tools for plotting
network behaviour are provided by the plotFcns field. These components collectively
describe the structure and functionality of the net object.
References:
[Link]
recognition
[Link]