0% found this document useful (0 votes)
5 views54 pages

Cours5-Deep Learning Fundamentals

Uploaded by

anh.chu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views54 pages

Cours5-Deep Learning Fundamentals

Uploaded by

anh.chu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Session 4

Introduction to AI

Session 5 – Deep Learning Fundamentals

Presented by: Pr. Khanh T. P. NGUYEN (UTTOP)


Table of Contents

1.Introduction & History


2.Neural Network Basics
3.Training & Optimization
4.Deep Neural Networks
5.Convolutional Neural Networks
6.Recurrent Neural Networks
7.Encoder–Decoder Architecture
8.Advanced Topics & Future Directions

Pr. Khanh T. P. NGUYEN


2
What is Deep Learning?

Deep learning is a subset of machine learning focused on algorithms


inspired by the structure and function of the brain's neural networks.

• Uses multiple processing layers to learn representations of data with


multiple levels of abstraction

• Automatically discovers features needed for detection or classification,


replacing manual feature engineering
• Evolution: From simple networks to modern deep architectures
Input Hidden Layers Output
(MLPs, CNNs, RNNs, Transformers)

• Applications: Image recognition, machine translation, voice


assistants, autonomousvehicles

Pr. Khanh T. P. NGUYEN


3
History of Deep Learning – Key Milestones

1943 1957 1986 1998 2006 2012 2017

McCulloch & Pitts Backpropagation Deep Belief Networks Transformer Architecture


First computational Rumelhart, Hinton, Geoffrey Hinton "Attention is All You Need"
model of a neuron & Williams Google Research

Perceptron invented
LeNet-5 by AlexNet by Alex Krizhevsky, Ilya
by Frank Rosenblatt
Yann LeCun Sutskever, Geoffrey Hinton

2018 2020 2022 2025

BERT by Google AI GPT-3 by OpenAI DALL.E-2 by GPT-5 by OpenAI


OpenAI

Pr. Khanh T. P. NGUYEN


4 4
Why Deep Learning Now?
Deep learning's recent explosive growth and success has Data Volume 2000
been driven by three converging factors:
Data Volume 2010

• Data Explosion: Massive growth in available training data Data Volume 2020
from internet, IoT devices, digitization, and cheap sensors

• Computational Power: GPU performance has increased by GPU Perf. 2000

1000x per decade since 2000, making previously infeasible


GPU Performance 2010
neural networks tractable
• Algorithmic Innovations: Better activation functions GPU Performance 2020

(ReLU), regularization techniques (dropout), and efficient DL 2000


frameworks (PyTorch, TensorFlow)

• End-to-End Learning: Replacing hand-engineered features DL Frameworks 2010

with learned representations directly from raw data


DL Frameworks 2020

Pr. Khanh T. P. NGUYEN


5
Deep Learning vs. Traditional Machine Learning

Traditional ML Deep Learning

• Manual feature extraction and engineering • Automatic feature learning

• Works well with smaller datasets • Requires large amounts of data

• Lower computational requirements • Computationally intensive (GPU acceleration)

• More interpretable models • More complex "black box" models

• Task-specific algorithms • End-to-end learning approach

Aspect Traditional ML Deep Learning

Suitable problems Structured data, tabular data Unstructured data: images, text, audio

Training time Minutes to hours Hours to weeks

Scalability Limited by feature engineering Scales well with more data

Pr. Khanh T. P. NGUYEN


6
Artificial Neuron & Perceptron
An artificial neuron is a mathematical function modeled after
biological neurons, serving as the fundamental building block of
neural networks. x₁
w₁

y = f(∑i wixi + b)
x₂
w₂ net
∑ f(x)
• Inputs (xi): Data features or outputs from previous neurons
• Weights (wi): Parameters that determine input importance x₃ w₃

• Bias (b): Additional parameter allowing neuron to fit data better b Bias (b)

• Activation Function (f): Introduces non-linearity (e.g., step, sigmoid,


ReLU)

Pr. Khanh T. P. NGUYEN


7
Core Components of Deep Learning Models

• Models

◦ The computational machinery for ingesting data of one type and outputting predictions of a possibly different type

◦ Often statistical models that can be estimated from data

• Layers

◦ A single neuron (input, scalar output, tunable parameters)

◦ An entire layer of neurons (set of inputs, corresponding outputs, tunable parameters)

◦ Layers are also referred to as individual mappings f(d) in a composition of functions

Pr. Khanh T. P. NGUYEN


8
Parameters and Hyperparameters

Parameters Hyperparameters

Learnable values within the model updated during training Configurations set by the user that are not learned during
training
Automatically optimized to minimize the loss function Manually chosen, often through experimentation or systematic
search
E.g.: weights and biases between neurons E.g.: learning rate, batch size, regularization parameters

• Scale of Parameters • Critical Hyperparameters

◦ Vision models: tens of millions of parameters ◦ Learning rate: influences speed and stability of learning

◦ Language models: up to hundreds of billions of ◦ Batch size: impacts generalization and training speed
parameters ◦ Dropout rate: controls regularization to prevent overfitting

◦ Weight decay: regularizes the model by penalizing large weights

Pr. Khanh T. P. NGUYEN


9
Network Structures: Layers
Neural networks are structured as interconnected layers
of neurons, with each layer performing specific Input Layer Hidden Layers Output Layer
(Features) (Feature extraction) (Predictions)
transformations on the data.

• Input Layer: Receives raw data and distributes it to the first


hidden layer without any transformation
• Hidden Layers: Perform computations and transfer
information from the input to the output layer; "deep"
networks have multiple hidden layers
• Output Layer: Produces the final network prediction (e.g.,
classification scores, regression values)

• Layer Connectivity: Fully-connected (dense) layers connect X₁, X₂, X₃ Feature Extraction ŷ₁, ŷ₂

every neuron to every neuron in adjacent layers

Pr. Khanh T. P. NGUYEN


10
Activation Functions
Activation functions introduce non-linearity to neural networks, allowing them to learn complex patterns. Without them,
neural networks would be limited to learning only linear relationships.

Sigmoid Tanh ReLU


σ(x) = 1 / (1 + e-x) tanh(x) = (ex - e-x) / (ex + e-x) f(x) = max(0, x)

1 1

0 -1 0 x

• Maps input to (0,1) range • Maps input to (-1,1) range • Linear for positive inputs
• Used for binary classification • Zero-centered output • Zero for negative inputs
• Suffers from vanishing gradient • Still has vanishing gradient issue • Avoids vanishing gradient
• Most popular in deep networks

Pr. Khanh T. P. NGUYEN


11
Forward Propagation

Forward Propagation
Input Hidden Output
Layer Layer Layer

x₁=0.5
w=0.1

?
w=0.2 w=0.5

x₂=0.8 w=-0.1 ŷ=

w=0.3 w=0.6
?
w=0.1

x₃=0.2
w=0.4

Pr. Khanh T. P. NGUYEN


12
Forward Propagation

Forward Propagation
Input Hidden Output
Layer Layer Layer

x₁=0.5
w=0.1

0.56
w=0.2 w=0.5

x₂=0.8 w=-0.1
ŷ=0.65
w=0.3 w=0.6
0.53
w=0.1

x₃=0.2
w=0.4

Pr. Khanh T. P. NGUYEN


13
Loss / Cost Functions
Loss functions quantify how well a model is performing
Mean Squared Error (MSE)
by measuring the discrepancy between predicted outputs
and actual target values. MSE = (1/n) Σ(yᵢ - ŷᵢ)²

• Mean Squared Error (MSE): Used for regression problems


to measure the average squared difference between
predictions and actual values
Actual (y) Prediction (ŷ)

• Cross-Entropy Loss: Used for classification problems to


measure the performance of a model whose output is a
probability value between 0 and 1 Cross-Entropy Loss
• The goal of training is to minimize these loss functions CE = -Σ y_i log(ŷ_i)
through optimization algorithms like gradient descent
Class 1 Class 2 Class 3
Question: Suppose we’re training a classifier that predicts whether an email is 1.0
spam or not. If the true label is ‘spam’ (1), and the model predicts 0.01 (1%
chance of spam), what happens with Cross-Entropy Loss compared to MSE ? 0.5

0.0

Pr. Khanh T. P. NGUYEN


14
Backpropagation Algorithm
Backpropagation is the core algorithm for training neural
networks, efficiently computing gradients of the cost
function with respect to all weights and biases.

• Applies the chain rule from calculus to compute gradients


recursively
• Propagates error backward through the network from
output to input

δᴸ = ∇ₐC ⊙
σ'(zᴸ)
Weight Update (Simplified)

Weight update formula:


w_new = w_old - η × ∂C/∂w

In practice:
All weights are updated
simultaneously based
on their contribution to the error

Pr. Khanh T. P. NGUYEN 15


Backpropagation Algorithm

Weight Update (Simplified)

Weight update formula:


w_new = w_old - η × ∂C/∂w

In practice:
Interpretation: All weights are updated
• Neuron 1 at the output should decrease its pre-activation (negative delta). simultaneously based
on their contribution to the error
• Neuron 2 should increase slightly (positive delta).

Pr. Khanh T. P. NGUYEN 16


Backpropagation Algorithm

Weight Update (Simplified)

Weight update formula:


w_new = w_old - η × ∂C/∂w

In practice:
All weights are updated
simultaneously based
on their contribution to the error

Pr. Khanh T. P. NGUYEN 17


Optimization Algorithms

• Role: Used to update model parameters to minimize the


loss function

• Minibatch Stochastic Gradient Descent (SGD):


◦ Uses minibatches to estimate the gradient
◦ A workhorse of deep learning optimization
◦ Key parameter: learning rate

• Adam (Adaptive Moment Estimation):


◦ Combines advantages of AdaGrad and RMSProp
◦ Adapts learning rate for each parameter
◦ Uses first and second moment estimates of the gradient Different trajectories of optimization
algorithms (SGD vs adaptive methods)
◦ Generally faster than classic SGD

• Convergence: Optimization quality directly influences


model performance

Pr. Khanh T. P. NGUYEN


18
Optimization: Gradient Descent
Gradient Descent is the primary optimization algorithm for neural
networks. It updates parameters to minimize the loss function.
SGD

• Batch Gradient Descent: Uses entire training set to compute gradients Minimum

• Stochastic Gradient Descent (SGD): Uses one example at a time


Mini-batch
• Mini-batch Gradient Descent: Compromise using small batches (e.g., Batch G

32, 64, 128)

Weight update rule:


Variant Speed Stability
θ = θ - η·∇J(θ)
Batch Slow Stable
Where η is the learning rate and ∇J(θ) is the gradient of the cost function
Mini-batch Medium Good

Stochastic Fast Noisy

Reference: PyTorch Tutorial, Andrew Ng Deep Learning Specialization Pr. Khanh T. P. NGUYEN 19
Regularization & Preventing Overfitting
Overfitting occurs when a model performs well on training data but
poorly on unseen data. Regularization techniques help prevent this
by constraining model complexity.

• L2 Regularization (Weight Decay): Adds a penalty term to the loss Overfitted Model
function proportional to the squared magnitude of weights Regularized Model
Training Data
Loss = Original Loss + λ·∑w²

Target Value
• Dropout: Randomly deactivates neurons during training with probability
Forces network to learn redundant representations

• Early Stopping: Monitor validation performance and stop training when Data Points
it begins to degrade

• Batch Normalization: Normalize the activations of a layer by adjusting


and scaling them during training

Helps to stabilize training by addressing internal covariate shift, allowing


for higher learning rates and faster convergence
Pr. Khanh T. P. NGUYEN
20
Training Deep Neural Networks

• Training Process: "Programming with Data"


◦ Instead of coding a specific recognizer, a program is coded
that can learn from a large, labeled dataset

• Iterative Approach
◦ Grabbing data
◦ Tweaking model "knobs" (parameters) to improve Visualization
performance of the
gradient
◦ Repeating until convergence descent
optimization
process
• Supervised Learning
◦ The network learns from examples with known
answers

◦ The goal is to generalize to unseen data

Pr. Khanh T. P. NGUYEN


21
Hyperparameters & Training Process
Hyperparameters are settings that control the training
process and significantly impact model performance.

Batch Size
Number of training examples used in one iteration. Smaller
batches provide more noise in gradient estimates (potentially Early Stopping Point
Training Loss
beneficial), while larger batches give more stable updates. Validation Loss

Learning Rate
Controls how much we adjust model weights during training. Too

Loss
high: may diverge. Too low: slow convergence or stuck in local
minima.

Epochs
5 10 15 20 25
Number of complete passes through the entire training dataset. Epochs
More epochs allow more learning but risk overfitting. Batch Size Small Large More stable gradients

• Use validation set to monitor for overfitting and guide early


stopping
• Learning rate schedules can improve training by reducing the
learning rate over time
Pr. Khanh T. P. NGUYEN
22
Data Handling in Deep Learning

• Tensors: N-dimensional arrays fundamental for storing and


manipulating data

• Modern frameworks: PyTorch and TensorFlow use tensors


similar to NumPy ndarrays but with extended capabilities

• Key capabilities: Automatic differentiation and leveraging


GPUs for accelerated computations Representation of
tensors with
• Applications: Representation of signals, storage of different dimensions
(vector, matrix, 3D
trainable parameters, and intermediate calculations
tensor)
(activations)

• Vectorization: Enables efficient execution of operations on


complete data sets instead of explicit iterations

Pr. Khanh T. P. NGUYEN


23
From Shallow to Deep Networks
Deep networks are more expressive and powerful than shallow
networks with the same number of parameters.

• Hierarchical representation learning: Each layer learns increasingly Shallow Network Deep Network
abstract features, capturing complex patterns that would require
exponentially more neurons in a shallow network

• Compositional structure: Deep networks can efficiently represent


compositional functions (functions of functions)
• Universal approximation: While shallow networks are universal
approximators in theory, they often need impractically wide
architectures
Input Wide Hidden Layer Input Multiple Hidden Layers

• Empirical success: Deep networks have consistently outperformed


shallow counterparts across domains

Pr. Khanh T. P. NGUYEN


24
18
Challenges in Deep Networks
As networks grow deeper, training becomes increasingly difficult due
to fundamental issues with gradient propagation.

Problems:
• Overfitting: High performance on training data, poor on test data
• Vanishing Gradients: Gradients become extremely small in early
layers, causing slow or no learning
• Exploding Gradients: Gradients grow exponentially large, causing
unstable updates

Solutions:
• ReLU Activation: Prevents gradient saturation unlike sigmoid/tanh
• Proper Initialization: Xavier/He initialization maintains variance
across layers

• Batch Normalization: Normalizes layer inputs, stabilizing learning


• Residual Connections: Allow gradients to flow directly through the
network

Pr. Khanh T. P. NGUYEN


25
Numerical Stability & Initialization
Proper weight initialization is crucial for neural network training,
affecting both convergence speed and final performance.

• Xavier/Glorot Initialization: Designed for symmetric activation functions


(tanh, sigmoid). Weights are drawn from a normal distribution with:

σ = sqrt(2 / (nin + nout))


• He Initialization: Optimized for ReLU activations, mitigating the dying
ReLU problem. Uses variance:

σ = sqrt(2 / nin)
• Benefits: Prevents vanishing/exploding gradients, stabilizes learning by
maintaining variance across layers, and accelerates convergence
Key Takeaway:
• Random init → unstable (exploding/vanishing).
• Xavier init → good for sigmoid/tanh, keeps variance under control.
• He init → best for ReLU, maintains strong gradients even in deep networks.

Pr. Khanh T. P. NGUYEN


Reference: PyTorch, NCC Tutorial 26
Multilayer Perceptrons (MLPs)
Multilayer Perceptrons (MLPs) are the standard feedforward
neural networks with multiple fully-connected hidden layers
between input and output layers.

• Structure: Input layer → Multiple hidden layers →


Output layer, with each neuron connected to every
neuron in adjacent layers

• Key features: Non-linear activation functions,


backpropagation learning, universal function approximation
capabilities

• Applications: Classification, regression, pattern


recognition, anomaly detection, and as components in
more complex models

• Limitations: Less efficient for spatial data (images) and


sequential data (text/time series) compared to specialized
Reference: PyTorch Tutorial, Michael Nielsen's Neural Networks Book
architectures

Pr. Khanh T. P. NGUYEN


27
Introduction to Convolutional Neural Networks (CNNs)
Convolutional Neural Networks are specialized deep neural networks
designed to process data with grid-like topology, particularly effective
for image analysis.

• Local Connectivity: Neurons connect only to a small region of the input,


capturing local patterns efficiently
Feature Maps Pooling. Fully connecxion Output

• Parameter Sharing: Same filters applied across entire image, drastically Input Image
reducing parameters compared to fully-connected networks

• Translation Invariance: Ability to recognize patterns regardless of their


position in the image

• Hierarchical Feature Learning: Early layers detect simple features


(edges, corners); deeper layers compose these into complex concept

• Computationally efficient with fewer parameters than fully- connected


architecture
• Easy to parallelize across GPU cores, leading to speedups

Pr. Khanh T. P. NGUYEN


28
CNN Architecture – Layers Explained
Convolutional Neural Networks are composed of
specialized layers designed to extract and process visual
features:

• Convolutional Layers: Apply filters to detect patterns


(edges, textures, shapes), preserving spatial relationships 224×224×3 Feature MapD
s ownsampling FC Layers
through shared weights and local connectivity
• Pooling Layers: Downsample feature maps to reduce Input Image Conv Pool Conv

Output
Pool Flatten
dimensionality, achieve spatial invariance, and control
overfitting (max pooling, average pooling)
Feature Extraction Classification
• Fully Connected Layers: Flatten feature maps and connect
all neurons to the previous layer; used for high-level
reasoning and final classification

• Additional Components: Activation functions (ReLU), batch


normalization, dropout for regularization

Pr. Khanh T. P. NGUYEN


29
Pooling & Feature Maps
Pooling operations reduce the spatial dimensions of the feature Feature Map (4×4)
3 7 2 5 Max Pooling (2×2)
maps, improving computational efficiency and providing spatial 9 8
1 9 8 4
invariance. 6 2 7 3 6 7

4 5 1 2
• Max Pooling: Takes the maximum value from each window, preserving Avg Pooling (2×2)
the most prominent features and dominant edges 5 4.8

• Average Pooling: Takes the average of all values in each window, 4.3 3.3
preserving background information better

• Feature Maps: Outputs from each convolutional filter that represent


detected patterns in the input (edges, textures, etc.)

• Pooling reduces sensitivity to exact feature locations, creating spatial Max pooling selects the largest value in each window, while
invariance critical for classification tasks average pooling computes the mean value

Pr. Khanh T. P. NGUYEN


30
25
CNN Visualization

Feature Maps Across Layers Filter Interpretations


How input images are transformed as they pass through CNN layers What different CNN filters actually detect in images

Diagonal Edges Horizontal Lines Vertical Lines

Input Image Conv1 - Edge Conv1 - Circle Conv1 - Corner

Circular Patterns Corners/Rectangles Complex Shapes

CNN
Deeper layers detect progressively more complex features:
Edges → Textures → Patterns → Parts → Objects
Conv2 - Shape Conv2 - Pattern Conv3 - Face Final Layer

Pr. Khanh T. P. NGUYEN


3128
CNN Example: LeNet, AlexNet
Landmark CNN architectures that shaped deep learning history. These networks pioneered techniques that remain fundamental to
modern CNN design.

LeNet-5 (1998) AlexNet (2012)

Ref. PyTorch tutorial

Key Features:
Ref. Paravision Lab
• First successful CNN architecture for digit recognition
Key Features:
(MNIST)
• Won the 2012 ImageNet competition, revolutionizing computer vision
• Pioneered use of convolutional and pooling layers
• Deeper architecture: 8 layers (5 convolutional, 3 fully- connected)
• 7-layer network: 2 convolutional, 2 pooling, 3 fully-connected
• Key innovations: ReLU activation, dropout regularization, GPU training

32
Pr. Khanh T. P. NGUYEN
Modern CNNs - VGG and Blocks Concept
• Concept of "Blocks"

◦ VGG introduced the idea of using repeating patterns of layers, or "blocks", as a general template for designing deep networks

◦ This abstraction simplified implementation using loops and subroutines in modern deep learning frameworks

• VGG Block Structure

◦ Sequence of 3×3 convolutions with padding of 1 (maintaining height and width)

◦ Non-linearity such as ReLU

◦ 2×2 max-pooling layer with stride of 2 (halving height and width)

• VGG Network Architecture

◦ Composed by connecting several VGG blocks in succession

◦ Similar to AlexNet and LeNet, has a convolutional part and a fully connected part

◦ Example (VGG-11): Five convolutional blocks, output channels doubling from 64 to 512

• Family of Networks

◦ VGG defines a family of networks (VGG-16, VGG-19) with different speed-accuracy trade-offs

◦ This modularity simplifies network assembly through simple Python code


Pr. Khanh T. P. NGUYEN
33
Network in Network (NiN)
• Fully Connected Layer Parameter Issue:
◦ Architectures like LeNet, AlexNet, and VGG use fully connected layers that consume
tremendous numbers of parameters
◦ VGG-11 requires approximately 400MB RAM for these layers alone
◦ Significant impediment for memory-limited devices like mobile phones

• 1×1 Convolution:

◦ Acts as a fully connected layer at each pixel location


◦ Introduces non-linearity at the channel level without increasing receptive field
◦ Key innovation in NiN architecture

• Global Average Pooling:


◦ Replaces final fully connected layers completely
◦ Uses NiN block with output channels equal to number of classes, followed by
global average pooling
◦ Significantly reduces model parameters without harming accuracy
◦ Adds translation invariance as a beneficial side effect

Pr. Khanh T. P. NGUYEN Ref. PyTorch tutorial


34
GoogLeNet and Inception

Ref. Visio.ai

• Multi-Branch Networks: GoogLeNet (2014) won the ImageNet Challenge by combining strengths of NiN and repeated blocks

• Stem-Body-Head Model: First network to clearly distinguish between data ingest (stem), processing (body) and prediction
(head)

• Inception Block Structure:


• 4 parallel branches with different convolution sizes (1×1, 3×3, 5×5)

• Uses 1×1 convolutions to reduce number of channels

• Pooling branch to capture different features

• Computational Efficiency: More efficient than predecessors (AlexNet, VGG) while providing improved accuracy

• Inspiration: The name "Inception" was inspired by the meme "We need to go deeper" 35
Pr. Khanh T. P. NGUYEN
Designing Modern CNN Architectures

• AnyNet Design Space: A modern template composed of three parts:


• Stem: Initial image processing (larger convolutions, batch norm)

• Body: Network core with multiple stages at decreasing


resolutions

• Head: Converts features into final predictions (e.g., via global average
pooling)

• Automated Search vs Human Intuition: Evolution toward automated


architecture exploration (RegNetX/Y)

• Shift to Transformers: While CNNs dominated due to inductive biases


(locality, translation invariance), Transformers have begun to displace
CNNs in accuracy for large-scale vision tasks

• Transformers succeed due to their flexibility and the availability of


massive image collections for training Ref. PyTorch tutorial

Pr. Khanh T. P. NGUYEN


36
Recurrent Neural Networks (RNNs): Introduction

o ot 1 ot ot 1
nfold

h ht 1 ht ht 1

t1 t t 1

Recurrent Neural Networks (RNNs) are specialized neural networks designed for processing sequential data by
maintaining a memory of previous inputs.

• Unlike feedforward networks, RNNs have connections that form cycles, allowing information to persist
• Ideal for tasks with temporal dependencies: text, speech, time series, video frames
• RNNs can process inputs of variable length, maintaining context across the sequence
• Applications: language modeling, translation, speech recognition, video analysis
Pr. Khanh T. P. NGUYEN
37
29
RNN Core Concepts
• Autoregressive Models

◦ Predict the current observation (xt) based on past observations (xt-1, ..., x1)

◦ Often, only a window of recent past observations is considered to keep the number of arguments constant

◦ Latent autoregressive models maintain a summary (ht) of past observations, updating it along with the prediction

• Sequence Models / Language Models

◦ Estimate the joint probability of an entire sequence

◦ For natural language data, they are called language models

◦ Used for evaluating sequence likelihood or sampling sequences

• Gradient Challenges

◦ Due to the "depth" introduced by long sequences, RNNs can suffer from vanishing or exploding gradients

◦ Similar to issues in deep MLPs but exacerbated by the temporal dimension

◦ Gradient Clipping: A technique to mitigate exploding gradients by bounding the maximum norm of gradients
38
Pr. Khanh T. P. NGUYEN
The Vanishing Gradient Problem in RNNs
Standard RNNs struggle with learning long-term
dependencies in sequential data due to the vanishing
gradient problem:
t=1 t=2 t=n

• During backpropagation through time, gradients are RNN RNN ... RNN
multiplied by the same weight matrix repeatedly Cell Cell Cell

• If eigenvalues are < 1, gradients shrink exponentially with


sequence length Forward propagation

• Early timesteps receive essentially no gradient signal, Diminishing gradient flow

preventing learning of long-range dependencies


∂L/∂W₁ = ∏ᵏⱼ₌₁ Wⱼ · ∂L/∂Wₖ → 0 as k increases
• Example: An RNN might fail to predict "I grew up in
France... I speak fluent French/ English" because the As the sequence length increases, the gradient signal becomes increasingly
relevant context is too far back weak for earlier timesteps

Pr. Khanh T. P. NGUYEN


39
Long Short-Term Memory (LSTMs)

• Foundational "modern" RNN architecture


(Hochreiter & Schmidhuber, 1997)

• Gated memory cell with internal state and


multiplicative gates

• Three types of gates:


◦ Input Gate (I): Determines whether a given
input should impact the internal state • Internal state (C) and hidden state (H):

◦ Forget Gate (F): Controls whether internal ◦ Internal state acts as a "conveyor belt" for long-
state should be reset, allowing the model to term memory
forget irrelevant past information ◦ Only the hidden state (H) is passed to the output
layer
◦ Output Gate (O): Determines whether the
internal state can influence the cell's output • Effectively addresses the vanishing/exploding
gradient problem through the gating mechanisms
Pr. Khanh T. P. NGUYEN
40
GRUs & RNN Variants
Gated Recurrent Unit (GRU) is a simplified alternative to
LSTMs that maintains similar performance with reduced GRU vs LSTM Architecture
complexity.
LSTM GRU
• Simplified Design: Combines the forget and input gates Forget Gate Update Gate
into a single "update gate"
Input Gate Reset Gate
• Memory Efficiency: Merges cell state and hidden state,
Output Gate
requiring fewer parameters Hidden State Only

Cell State
• Key Advantages:
- Faster training time Comparison
3 gates (forget, input, output) 2 gates (update, reset)
- Often better performance on smaller datasets
Separate cell state & hidden state Single state vector
- Simpler implementation More parameters, better for Less parameters, faster for
complex dependences tranining
• Other RNN Variants: Depth Gated RNNs, Clockwork RNNs,
Bidirectional RNNs

Pr. Khanh T. P. NGUYEN


41
Deep and Bidirectional RNNs

• Stacking RNNs: Building networks that are "deep" not only in the time direction but also from input to output
○ Each RNN layer processes the sequence of outputs from the previous layer

• Hidden State Propagation: At time step t, each RNN cell depends on:
○ The state of the same layer at the previous step
○ The value from the previous layer at the same step

• Bidirectional RNNs (BiRNNs): Condition predictions on both leftward AND rightward context
○ Two unidirectional RNNs: one forward (→Ht) and one backward (←Ht)

○ Ideal for sequence encoding where entire context is available

• Training Complexity: BiRNNs are more costly to train due to long gradient chains through two independent RNNs
Pr. Khanh T. P. NGUYEN
42
Encoder-Decoder Architecture
• Handling Variable-Length, Unaligned Sequences

◦ Standard approach for sequence-to-sequence problems (e.g., machine translation)

◦ Addresses limitation of fixed-shape context variables in traditional RNNs

• Components

◦ Encoder: Processes variable-length input sequence into a fixed-shape state (context variable)

◦ Decoder: Acts as a conditional language model, using the context to predict target sequence tokens

◦ Encoder and decoder can be different types of neural networks

• Example: Machine Translation (English to French)

◦ Encoder processes English input ("They", "are", "watching", ".") into a state

◦ Decoder generates French translation token by token ("Ils", "regardent", ".")

• Fixed-Shape Context Variable

◦ Early seq2seq models (Sutskever et al., 2014) compressed entire input into single vector

◦ Creates bottleneck for long sequences - difficult to store all information in fixed-dimensional state
Pr. Khanh T. P. NGUYEN
43
Detailed Encoder-Decoder Components
• Encoder:
◦ Transforms input sequence (x₁, x₂, ..., xₙ) into a context
representation
◦ Can be implemented using RNNs, LSTMs, GRUs, or
Transformers
◦ Outputs hidden states that capture input sequence
information

• Decoder:
◦ Generates output sequence (y₁, y₂, ..., yₘ) token by token
◦ Uses encoder's context and previous predictions at each step
◦ Training typically uses teacher forcing (feeding ground truth
as input)
• Implementation Options:
LSTM-based Encoder-Decoder architecture showing the detailed components of how
◦ RNN-based: Simplest but struggles with long sequences information flows from input sequence through memory cells to output sequence
generation.
◦ LSTM/GRU-based: Better handling of long-term
dependencies
◦ Transformer-based: State-of-the-art performance via self-
attention Pr. Khanh T. P. NGUYEN
44
Applications and Extensions of Encoder-Decoders

• Limitation of Basic Encoder-Decoder: The "bottleneck


problem" where all input information must be compressed
into a fixed-size context vector, problematic for long
sequences

• Attention Mechanism: A breakthrough extension that


allows the decoder to focus on different parts of the input
sequence at each decoding step

• Benefits of Attention: Solves the bottleneck problem,


provides interpretability through visualization of attention
weights, and significantly improves performance on long
Source: Lena Voita's NLP Course (lena-voita.github.io)
sequences

Pr. Khanh T. P. NGUYEN


45
Attention Mechanisms - Core Ideas

• Queries, Keys, and Values Analogy:


◦ Query (Q): What you're looking for in the input
◦ Keys (K): Labels or indices for different pieces of
information

◦ Values (V): The actual information associated


with the keys

• Attention Scoring Functions:


◦ Dot-Product Attention: Calculates
similarity as the dot product between
query and key vectors

◦ Additive Attention: Used when queries


• Limitations of Fixed-Length Context: Earlier sequence models compressed the entire and keys have different lengths
input into a single fixed- length vector, which proved inefficient for long sequences

• Dynamic Focus on Relevant Parts: Attention allows the decoder to selectively "revisit"
specific parts of the input sequence at each decoding step

Pr. Khanh T. P. NGUYEN


46
Generative Models: Autoencoders, VAEs, GANs

Ref. Google developer

• Generative Models: Focus on synthesis rather than discrimination, fitting a density model to training data and sampling from it

• Variational Autoencoders (VAEs): Learn a latent representation of input data, consisting of an encoder mapping inputs to a latent space and a
decoder reconstructing inputs from this space

• Generative Adversarial Networks (GANs): Two competing networks - a Generator creates new data samples, while a Discriminator learns to distinguish
fake from real data

• DCGAN Architecture: Uses transposed convolutions in the generator to enlarge input size from noise vector to image, with batch normalization and
ReLU/Leaky ReLU activations

• Applications: Photorealistic image synthesis, style transfer, data augmentation, and creative content generation
Pr. Khanh T. P. NGUYEN
47
The Transformer Architecture

• Solely Based on Attention Mechanisms: Unlike earlier models that might still rely
on RNNs or CNNs, the Transformer model is entirely based on attention
mechanisms, without any convolutional or recurrent layers

• Encoder-Decoder Structure: The overall architecture processes the input


(source) sequence through the encoder and generates the output (target)
sequence through the decoder

• Key Components:
• Multi-head attention (processing information from different
representation subspaces)

• Self-attention (allowing tokens to attend to each other)

• Positional encoding (adding order information)

• Feed-forward networks (processing at each position independently)

• Revolutionary Impact: The "Attention Is All You Need" approach has become
pervasive across language, vision, speech, and reinforcement learning
applications

Ref. PyTorch tutorial


Pr. Khanh T. P. NGUYEN
48
Large-Scale Pretraining with Transformers
• Generalization through Pretraining

◦ Training on large datasets creates "generalist" models capable of multiple tasks

◦ Models learn rich representations adaptable for specific downstream tasks

• Three Modes of Transformer Pretraining

◦ Encoder-Only (e.g., BERT): Bidirectional context for classification and text understanding tasks

◦ Encoder-Decoder (e.g., T5, BART): Text-to-text transformation tasks like translation

◦ Decoder-Only (e.g., GPT series): De facto architecture for large language models

• Scalability

◦ Performance improves with larger models, more data, and more compute

◦ Power-law scaling: predictable improvements with resource increases

• Emergent Abilities and Adaptation

◦ Fine-tuning: Adapting pretrained models for specific tasks

◦ In-context learning: Zero/few-shot task performance without parameter updates

◦ Alignment: Fine-tuning with human feedback (e.g., InstructGPT, ChatGPT)


49
Pr. Khanh T. P. NGUYEN
Practical Considerations and Future Directions
• Computational Performance

◦ GPUs and TPUs: Accelerating deep learning computations through parallel processing capabilities

◦ Distributed Training: Splitting data and work across multiple GPUs or machines for large models

◦ Memory Management: Critical for efficiency between CPU and GPU memory

• Optimizing for Efficiency

◦ Quantization: Reducing parameter precision (FP32 → FP16, INT8) to decrease memory footprint and computational cost

◦ Model Merging: Techniques like TIES-Merging and Task Arithmetic to combine parameters of multiple fine-tuned models

• Foundation Models

◦ Large pretrained models (LLMs, visual language models) repurposed for various tasks

◦ "Foundational" because they capture broad knowledge from vast datasets for downstream applications

• Beyond Traditional Tasks

◦ Neural Style Transfer: Combining content and style of different images using pretrained CNNs

◦ Object Detection: Models like SSD and R-CNN predicting classes and bounding boxes for objects

◦ Semantic Segmentation: Classifying images at pixel level using Fully Convolutional Networks
50
Pr. Khanh T. P. NGUYEN
Other Advanced Topics
Beyond CNNs and RNNs, several advanced deep learning paradigms are driving innovation in AI research and applications:

Transfer Learning Self-Learning Reinforcement Learning


• Uses knowledge gained from solving • Learns without explicit labels by • Agents learn optimal behavior
one problem to help solve a related discovering patterns and structure in data through trial-and-error with
one environment feedback
• Can adapt to new data distributions
• Pre-trained models serve as starting without human intervention • Deep RL combines neural networks
points, requiring less data and training • Applications: anomaly detection, with reinforcement learning
time representation learning for sensor frameworks
• Fine-tuning strategies: feature data, pretraining models for • Applications: game playing (AlphaGo),
extraction vs. end-to-end fine-tuning downstream tasks robotics, autonomous systems,
resource management

Pr. Khanh T. P. NGUYEN


51
Deep Learning in the Real World
Moving deep learning models from research to production presents
unique challenges and requires careful consideration of deployment
pipelines and operational requirements.

• ML Lifecycle Management: Data preparation, model training, evaluation, Data


Collection
Data
Processing
Feature
Engineering
deployment, and continuous monitoring
• Model Optimization: Quantization, pruning, distillation to reduce size Model Development
Training, Validation, Testing
and increase inference speed

• Key Challenges: Explainability, fairness, bias mitigation, computation Model Integration Monitoring
costs, and latency requirements Optimization & Deployment & Feedback

• Deployment Options: Cloud services, edge devices, mobile, hardware Continuous Improvement
accelerators (GPUs, TPUs, specialized chips)
• Regulatory Compliance: Data privacy, security concerns, and industry-
specific regulations

Pr. Khanh T. P. NGUYEN


52
Conclusion & Takeaways

Key Takeaways Conclusion


• Deep learning excels with large datasets and complex Deep learning has revolutionized artificial intelligence, enabling
patterns where traditional ML methods fall short breakthroughs across numerous domains from computer
vision to natural language processing. As the field continues to
• Understanding backpropagation and gradient descent is
fundamental to training neural networks evolve, understanding these fundamental concepts provides
the foundation for both applying existing models and
• Different architectures (CNNs, RNNs, Transformers) are developing new approaches.
specialized for different data types and tasks
• Regularization, proper initialization, and optimization Where to Go From Here
techniques are crucial for effective training
Explore practical implementations with frameworks like PyTorch
and TensorFlow, and stay current with research papers and
open-source projects in the rapidly evolving field of deep
learning.

Pr. Khanh T. P. NGUYEN


53
References

Courses & Tutorials Websites & Blogs


Ng, Andrew. (2018-2025). Deep Learning Specialization. Coursera. Olah, Chris. (2015). Understanding LSTM Networks. colah's blog.
colah.github.io
Khapra, Mitesh M. (2017). Deep Learning Tutorial. The National
Conference on Communications (NCC). GeeksforGeeks. (2023-2025). Backpropagation in Neural Network.
geeksforgeeks.org
PyTorch Team. (2023-2025). Learn the Basics —PyTorch Tutorials.
pytorch.org. Serokell. (2023). What is backpropagation in neural networks?
serokell.io
PyTorch Team. (2023-2025). Deep Learning with PyTorch: A 60
Minute Blitz. pytorch.org. Mazur, Matt. (2015). A Step by Step Backpropagation Example.
mattmazur.com
Books
Paper References
Nielsen, Michael A. (2015). Neural Networks and Deep Learning.
Determination Press. neuralnetworksanddeeplearning.com Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
Neural Computation, 9(8), 1735–1780.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT
Press. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet
Classification with Deep Convolutional Neural Networks. NeurIPS
Note: 2012.
If you identify any inaccurate content, unclear
citations, or incorrect references, please
contact: [email protected] Pr. Khanh T. P. NGUYEN
55

You might also like