Advanced Natural Language Processing
Lecture 2: Basics of Deep Learning
陈冠华 CHEN Guanhua
Department of Statistics and Data Science
Content
• Introduction
• History
• Model
• Optimization
• Training
• Coding
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 2
Data Analyses
• To create a function to map an input X into an output Y,
• Examples:
• To create such a system, we can use
• Manual creation of rules
• Machine learning from paired data <X, Y>
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 3
Machine Learning
• Statistical approach
• Generalized linear model, linear regression, logistic regression
• Gaussian mixture models
• Support vector machine (SVM)
• Decision trees, random forests
• Deep learning approach
• Modeling with different deep neural networks
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 4
Why didn’t They Work Before?
• Datasets too small
• For machine translation, not really better until you have 1M+ parallel sentences
(and really need a lot more)
• Optimization not well understood
• Good initialization
• Momentum (Adagrad/Adam) work best out-of-the-box
• Other innovations
• Word embedding
• Dropout, layer normalization, residual connection
• Large-scale computing system
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 5
Deep Learning
• Modeling with deep neural networks
• Optimized on big data
• Research Contributions
Collect/create the
Design a model
dataset
Design a task Optimize a model
Analyze and
Design an evaluation
understand the model
metric
and results
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 6
Deep Learning Algorithm Sketch
• Create a model and define a loss
• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 7
Deep Learning Algorithm Sketch
Forward Pass
Python files
Data Neural Model Ground Truth • [Link]
Input Network Output (Golden Label) • [Link]
• [Link]
• [Link]
• [Link]
Backward Pass
Optimizer Training
Gradient Update Loss
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 8
Dataset Split
When creating a system, use three sets of data
• Training Set
• Generally larger dataset, used during system design, creation, and learning of
parameters.
• Development/validation Set
• Smaller dataset for testing different design decisions ("hyper-parameters").
• Test Set
• Dataset reflecting the final test scenario, do not use for making design decisions.
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 9
Data is Very Important
Data
Scale Quality
Correctnes
Diverse
s
Tasks Difficulties
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 10
Deep Learning Algorithm Sketch
• Create a model and define a loss
• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 11
Different Model Structures
Feed-forward NNs Recurrent NNs
Convolutional NNs
Transformer
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 12
Feed-forward Neural Network
• The units are connected with no cycles
• The outputs from units in each layer are passed to units in the next higher
layer
• No outputs are passed back to lower layers
• Fully-connected (FC) layers
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 13
Feed-forward Neural Network
𝑥1 h1(1) h2(1)
𝑥2 h1(2) h2(2)
𝑥3 h1(3) h2(3)
h1(4) h2(4)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 14
Feed-forward Neural Network
𝑥1 h1(1) h2(1)
𝑥2 h1(2) h2(2)
(3) (2) (1 ) (2) (2) (2 ) (3 ) (2) (4 )
h1(3) h 2 = 𝑓 (𝑤 3 ,1 h1 +𝑤3 , 2 h1 +𝑤3 , 3 h1 +𝑤3 , 4 h1 )
𝑥3 h2(3)
Non-linearity (activation function) : or
h1(4) h2(4)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 15
Feedforward Neural Network for
Classification
• Use softmax to get the probability distribution
Neural networks are difficult to optimize.
SGD can only converge to local
minimum. Initializations and optimizers
matter a lot!
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 16
Activation Functions
• Add non-linearities into neural networks
• Allowing the neural networks to learn powerful operations
1
𝑓 (𝑥 )= −𝑥 𝑓 ( 𝑥 ) =max ( 𝑥 , 0)
1+ 𝑒
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 17
Activation Functions
• GeLU (Gaussian Error Linear Unit)
• Used in GPT-3, BERT, and many other models
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 18
Loss Functions
• Given a labeled example , we use a neural network to estimate the conditional
probability and predict the label as
• We compute how close our prediction w.r.t. the true label by a loss function
• Classification: Cross-Entropy
• Regression: L1 loss, L2 loss (a.k.a. Mean Square Error)
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 19
Deep Learning Algorithm Sketch
• Create a model and define a loss (i.e., construct a computation graph)
• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 20
Backpropagation
Forward propagation:
from input to output layer
Back propagation:
from output to input
layer
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 21
Back-propagation in PyTorch
PyTorch did back-propagation for you in this one line of code!
A toy pytorch example to train an NN model
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 22
Deep Learning Algorithm Sketch
• Create a model and define a loss (i.e., construct a computation graph)
• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 23
Optimizer Update
• Most deep learning toolkits implement the parameter updates by calling
[Link]() function
Before optimizer update After optimizer update
Can be updated with standard SGD or Adam
optimizer
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 24
Standard SGD
• Standard stochastic gradient descent
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 25
Adam Optimizer
• Most standard optimization option in NLP and beyond
• Considers rolling average of gradient , and momentum
Momentum
Rolling Average of Gradient
Correction of bias
Further reading: how to use
the optimizer in
Pytorch
Final update the parameter
optimizer = [Link]([Link](), lr=0.0005, betas=(0.99,
0.999))
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 26
Adam Optimizer
• Gradient descent | Khan Academy
• Intuition of Adam Optimizer
• Blog: An updated overview of recent gradient descent algorithms
• (paper) Convex Optimization: Algorithms and Complexity
• Course: Optimization for Machine Learning
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 27
Learning Rate
learning rate schedule [another link], warmup
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 28
Tensors
• An n-dimensional array
• Widely used in neural networks
• Parameters in NNs consist of different shape of tensors, which store both their
values and gradients (e.g., x, [Link])
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 29
Tensor Operations
•
create tensors from list, [Link]
matrix multiply Element-wise matrix multiply
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 30
Efficiency Tricks: Mini-batching
• On modern hardware 10 operations of size 1 is much slower than 1 operation
of size 10
• Mini-batching combines together smaller operations into one big one
• About padding
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 31
Deep Learning Algorithm Sketch
Forward Pass
Python files
Data Neural Model Ground Truth • [Link]
Input Network Output (Golden Label) • [Link]
• [Link]
• [Link]
• [Link]
Backward Pass
Optimizer Training
Gradient Update Loss
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 32
Different Learnings
• Supervised/unsupervised learning
• Self-supervised learning
• Transfer learning
• Few-shot/zero-shot learning
Stanford STATS214 / CS229M: Machine Learning Theory
Guanhua Chen @ Stat-DS,
Department SUSTech and Data
of Statistics 33
Thank you