Cours5-Deep Learning Fundamentals
Cours5-Deep Learning Fundamentals
Introduction to AI
Perceptron invented
LeNet-5 by AlexNet by Alex Krizhevsky, Ilya
by Frank Rosenblatt
Yann LeCun Sutskever, Geoffrey Hinton
• Data Explosion: Massive growth in available training data Data Volume 2020
from internet, IoT devices, digitization, and cheap sensors
Suitable problems Structured data, tabular data Unstructured data: images, text, audio
y = f(∑i wixi + b)
x₂
w₂ net
∑ f(x)
• Inputs (xi): Data features or outputs from previous neurons
• Weights (wi): Parameters that determine input importance x₃ w₃
• Bias (b): Additional parameter allowing neuron to fit data better b Bias (b)
• Models
◦ The computational machinery for ingesting data of one type and outputting predictions of a possibly different type
• Layers
Parameters Hyperparameters
Learnable values within the model updated during training Configurations set by the user that are not learned during
training
Automatically optimized to minimize the loss function Manually chosen, often through experimentation or systematic
search
E.g.: weights and biases between neurons E.g.: learning rate, batch size, regularization parameters
◦ Vision models: tens of millions of parameters ◦ Learning rate: influences speed and stability of learning
◦ Language models: up to hundreds of billions of ◦ Batch size: impacts generalization and training speed
parameters ◦ Dropout rate: controls regularization to prevent overfitting
• Layer Connectivity: Fully-connected (dense) layers connect X₁, X₂, X₃ Feature Extraction ŷ₁, ŷ₂
1 1
0 -1 0 x
• Maps input to (0,1) range • Maps input to (-1,1) range • Linear for positive inputs
• Used for binary classification • Zero-centered output • Zero for negative inputs
• Suffers from vanishing gradient • Still has vanishing gradient issue • Avoids vanishing gradient
• Most popular in deep networks
Forward Propagation
Input Hidden Output
Layer Layer Layer
x₁=0.5
w=0.1
?
w=0.2 w=0.5
x₂=0.8 w=-0.1 ŷ=
w=0.3 w=0.6
?
w=0.1
x₃=0.2
w=0.4
Forward Propagation
Input Hidden Output
Layer Layer Layer
x₁=0.5
w=0.1
0.56
w=0.2 w=0.5
x₂=0.8 w=-0.1
ŷ=0.65
w=0.3 w=0.6
0.53
w=0.1
x₃=0.2
w=0.4
0.0
δᴸ = ∇ₐC ⊙
σ'(zᴸ)
Weight Update (Simplified)
In practice:
All weights are updated
simultaneously based
on their contribution to the error
In practice:
Interpretation: All weights are updated
• Neuron 1 at the output should decrease its pre-activation (negative delta). simultaneously based
on their contribution to the error
• Neuron 2 should increase slightly (positive delta).
In practice:
All weights are updated
simultaneously based
on their contribution to the error
• Batch Gradient Descent: Uses entire training set to compute gradients Minimum
Reference: PyTorch Tutorial, Andrew Ng Deep Learning Specialization Pr. Khanh T. P. NGUYEN 19
Regularization & Preventing Overfitting
Overfitting occurs when a model performs well on training data but
poorly on unseen data. Regularization techniques help prevent this
by constraining model complexity.
• L2 Regularization (Weight Decay): Adds a penalty term to the loss Overfitted Model
function proportional to the squared magnitude of weights Regularized Model
Training Data
Loss = Original Loss + λ·∑w²
Target Value
• Dropout: Randomly deactivates neurons during training with probability
Forces network to learn redundant representations
• Early Stopping: Monitor validation performance and stop training when Data Points
it begins to degrade
• Iterative Approach
◦ Grabbing data
◦ Tweaking model "knobs" (parameters) to improve Visualization
performance of the
gradient
◦ Repeating until convergence descent
optimization
process
• Supervised Learning
◦ The network learns from examples with known
answers
Batch Size
Number of training examples used in one iteration. Smaller
batches provide more noise in gradient estimates (potentially Early Stopping Point
Training Loss
beneficial), while larger batches give more stable updates. Validation Loss
Learning Rate
Controls how much we adjust model weights during training. Too
Loss
high: may diverge. Too low: slow convergence or stuck in local
minima.
Epochs
5 10 15 20 25
Number of complete passes through the entire training dataset. Epochs
More epochs allow more learning but risk overfitting. Batch Size Small Large More stable gradients
• Hierarchical representation learning: Each layer learns increasingly Shallow Network Deep Network
abstract features, capturing complex patterns that would require
exponentially more neurons in a shallow network
Problems:
• Overfitting: High performance on training data, poor on test data
• Vanishing Gradients: Gradients become extremely small in early
layers, causing slow or no learning
• Exploding Gradients: Gradients grow exponentially large, causing
unstable updates
Solutions:
• ReLU Activation: Prevents gradient saturation unlike sigmoid/tanh
• Proper Initialization: Xavier/He initialization maintains variance
across layers
σ = sqrt(2 / nin)
• Benefits: Prevents vanishing/exploding gradients, stabilizes learning by
maintaining variance across layers, and accelerates convergence
Key Takeaway:
• Random init → unstable (exploding/vanishing).
• Xavier init → good for sigmoid/tanh, keeps variance under control.
• He init → best for ReLU, maintains strong gradients even in deep networks.
• Parameter Sharing: Same filters applied across entire image, drastically Input Image
reducing parameters compared to fully-connected networks
Output
Pool Flatten
dimensionality, achieve spatial invariance, and control
overfitting (max pooling, average pooling)
Feature Extraction Classification
• Fully Connected Layers: Flatten feature maps and connect
all neurons to the previous layer; used for high-level
reasoning and final classification
4 5 1 2
• Max Pooling: Takes the maximum value from each window, preserving Avg Pooling (2×2)
the most prominent features and dominant edges 5 4.8
• Average Pooling: Takes the average of all values in each window, 4.3 3.3
preserving background information better
• Pooling reduces sensitivity to exact feature locations, creating spatial Max pooling selects the largest value in each window, while
invariance critical for classification tasks average pooling computes the mean value
CNN
Deeper layers detect progressively more complex features:
Edges → Textures → Patterns → Parts → Objects
Conv2 - Shape Conv2 - Pattern Conv3 - Face Final Layer
Key Features:
Ref. Paravision Lab
• First successful CNN architecture for digit recognition
Key Features:
(MNIST)
• Won the 2012 ImageNet competition, revolutionizing computer vision
• Pioneered use of convolutional and pooling layers
• Deeper architecture: 8 layers (5 convolutional, 3 fully- connected)
• 7-layer network: 2 convolutional, 2 pooling, 3 fully-connected
• Key innovations: ReLU activation, dropout regularization, GPU training
32
Pr. Khanh T. P. NGUYEN
Modern CNNs - VGG and Blocks Concept
• Concept of "Blocks"
◦ VGG introduced the idea of using repeating patterns of layers, or "blocks", as a general template for designing deep networks
◦ This abstraction simplified implementation using loops and subroutines in modern deep learning frameworks
◦ Similar to AlexNet and LeNet, has a convolutional part and a fully connected part
◦ Example (VGG-11): Five convolutional blocks, output channels doubling from 64 to 512
• Family of Networks
◦ VGG defines a family of networks (VGG-16, VGG-19) with different speed-accuracy trade-offs
• 1×1 Convolution:
Ref. Visio.ai
• Multi-Branch Networks: GoogLeNet (2014) won the ImageNet Challenge by combining strengths of NiN and repeated blocks
• Stem-Body-Head Model: First network to clearly distinguish between data ingest (stem), processing (body) and prediction
(head)
• Computational Efficiency: More efficient than predecessors (AlexNet, VGG) while providing improved accuracy
• Inspiration: The name "Inception" was inspired by the meme "We need to go deeper" 35
Pr. Khanh T. P. NGUYEN
Designing Modern CNN Architectures
• Head: Converts features into final predictions (e.g., via global average
pooling)
o ot 1 ot ot 1
nfold
h ht 1 ht ht 1
t1 t t 1
Recurrent Neural Networks (RNNs) are specialized neural networks designed for processing sequential data by
maintaining a memory of previous inputs.
• Unlike feedforward networks, RNNs have connections that form cycles, allowing information to persist
• Ideal for tasks with temporal dependencies: text, speech, time series, video frames
• RNNs can process inputs of variable length, maintaining context across the sequence
• Applications: language modeling, translation, speech recognition, video analysis
Pr. Khanh T. P. NGUYEN
37
29
RNN Core Concepts
• Autoregressive Models
◦ Predict the current observation (xt) based on past observations (xt-1, ..., x1)
◦ Often, only a window of recent past observations is considered to keep the number of arguments constant
◦ Latent autoregressive models maintain a summary (ht) of past observations, updating it along with the prediction
• Gradient Challenges
◦ Due to the "depth" introduced by long sequences, RNNs can suffer from vanishing or exploding gradients
◦ Gradient Clipping: A technique to mitigate exploding gradients by bounding the maximum norm of gradients
38
Pr. Khanh T. P. NGUYEN
The Vanishing Gradient Problem in RNNs
Standard RNNs struggle with learning long-term
dependencies in sequential data due to the vanishing
gradient problem:
t=1 t=2 t=n
• During backpropagation through time, gradients are RNN RNN ... RNN
multiplied by the same weight matrix repeatedly Cell Cell Cell
◦ Forget Gate (F): Controls whether internal ◦ Internal state acts as a "conveyor belt" for long-
state should be reset, allowing the model to term memory
forget irrelevant past information ◦ Only the hidden state (H) is passed to the output
layer
◦ Output Gate (O): Determines whether the
internal state can influence the cell's output • Effectively addresses the vanishing/exploding
gradient problem through the gating mechanisms
Pr. Khanh T. P. NGUYEN
40
GRUs & RNN Variants
Gated Recurrent Unit (GRU) is a simplified alternative to
LSTMs that maintains similar performance with reduced GRU vs LSTM Architecture
complexity.
LSTM GRU
• Simplified Design: Combines the forget and input gates Forget Gate Update Gate
into a single "update gate"
Input Gate Reset Gate
• Memory Efficiency: Merges cell state and hidden state,
Output Gate
requiring fewer parameters Hidden State Only
Cell State
• Key Advantages:
- Faster training time Comparison
3 gates (forget, input, output) 2 gates (update, reset)
- Often better performance on smaller datasets
Separate cell state & hidden state Single state vector
- Simpler implementation More parameters, better for Less parameters, faster for
complex dependences tranining
• Other RNN Variants: Depth Gated RNNs, Clockwork RNNs,
Bidirectional RNNs
• Stacking RNNs: Building networks that are "deep" not only in the time direction but also from input to output
○ Each RNN layer processes the sequence of outputs from the previous layer
• Hidden State Propagation: At time step t, each RNN cell depends on:
○ The state of the same layer at the previous step
○ The value from the previous layer at the same step
• Bidirectional RNNs (BiRNNs): Condition predictions on both leftward AND rightward context
○ Two unidirectional RNNs: one forward (→Ht) and one backward (←Ht)
• Training Complexity: BiRNNs are more costly to train due to long gradient chains through two independent RNNs
Pr. Khanh T. P. NGUYEN
42
Encoder-Decoder Architecture
• Handling Variable-Length, Unaligned Sequences
• Components
◦ Encoder: Processes variable-length input sequence into a fixed-shape state (context variable)
◦ Decoder: Acts as a conditional language model, using the context to predict target sequence tokens
◦ Encoder processes English input ("They", "are", "watching", ".") into a state
◦ Early seq2seq models (Sutskever et al., 2014) compressed entire input into single vector
◦ Creates bottleneck for long sequences - difficult to store all information in fixed-dimensional state
Pr. Khanh T. P. NGUYEN
43
Detailed Encoder-Decoder Components
• Encoder:
◦ Transforms input sequence (x₁, x₂, ..., xₙ) into a context
representation
◦ Can be implemented using RNNs, LSTMs, GRUs, or
Transformers
◦ Outputs hidden states that capture input sequence
information
• Decoder:
◦ Generates output sequence (y₁, y₂, ..., yₘ) token by token
◦ Uses encoder's context and previous predictions at each step
◦ Training typically uses teacher forcing (feeding ground truth
as input)
• Implementation Options:
LSTM-based Encoder-Decoder architecture showing the detailed components of how
◦ RNN-based: Simplest but struggles with long sequences information flows from input sequence through memory cells to output sequence
generation.
◦ LSTM/GRU-based: Better handling of long-term
dependencies
◦ Transformer-based: State-of-the-art performance via self-
attention Pr. Khanh T. P. NGUYEN
44
Applications and Extensions of Encoder-Decoders
• Dynamic Focus on Relevant Parts: Attention allows the decoder to selectively "revisit"
specific parts of the input sequence at each decoding step
• Generative Models: Focus on synthesis rather than discrimination, fitting a density model to training data and sampling from it
• Variational Autoencoders (VAEs): Learn a latent representation of input data, consisting of an encoder mapping inputs to a latent space and a
decoder reconstructing inputs from this space
• Generative Adversarial Networks (GANs): Two competing networks - a Generator creates new data samples, while a Discriminator learns to distinguish
fake from real data
• DCGAN Architecture: Uses transposed convolutions in the generator to enlarge input size from noise vector to image, with batch normalization and
ReLU/Leaky ReLU activations
• Applications: Photorealistic image synthesis, style transfer, data augmentation, and creative content generation
Pr. Khanh T. P. NGUYEN
47
The Transformer Architecture
• Solely Based on Attention Mechanisms: Unlike earlier models that might still rely
on RNNs or CNNs, the Transformer model is entirely based on attention
mechanisms, without any convolutional or recurrent layers
• Key Components:
• Multi-head attention (processing information from different
representation subspaces)
• Revolutionary Impact: The "Attention Is All You Need" approach has become
pervasive across language, vision, speech, and reinforcement learning
applications
◦ Encoder-Only (e.g., BERT): Bidirectional context for classification and text understanding tasks
◦ Decoder-Only (e.g., GPT series): De facto architecture for large language models
• Scalability
◦ Performance improves with larger models, more data, and more compute
◦ GPUs and TPUs: Accelerating deep learning computations through parallel processing capabilities
◦ Distributed Training: Splitting data and work across multiple GPUs or machines for large models
◦ Memory Management: Critical for efficiency between CPU and GPU memory
◦ Quantization: Reducing parameter precision (FP32 → FP16, INT8) to decrease memory footprint and computational cost
◦ Model Merging: Techniques like TIES-Merging and Task Arithmetic to combine parameters of multiple fine-tuned models
• Foundation Models
◦ Large pretrained models (LLMs, visual language models) repurposed for various tasks
◦ "Foundational" because they capture broad knowledge from vast datasets for downstream applications
◦ Neural Style Transfer: Combining content and style of different images using pretrained CNNs
◦ Object Detection: Models like SSD and R-CNN predicting classes and bounding boxes for objects
◦ Semantic Segmentation: Classifying images at pixel level using Fully Convolutional Networks
50
Pr. Khanh T. P. NGUYEN
Other Advanced Topics
Beyond CNNs and RNNs, several advanced deep learning paradigms are driving innovation in AI research and applications:
• Key Challenges: Explainability, fairness, bias mitigation, computation Model Integration Monitoring
costs, and latency requirements Optimization & Deployment & Feedback
• Deployment Options: Cloud services, edge devices, mobile, hardware Continuous Improvement
accelerators (GPUs, TPUs, specialized chips)
• Regulatory Compliance: Data privacy, security concerns, and industry-
specific regulations