0% found this document useful (0 votes)
32 views34 pages

Lecture02-Basics of Deep Learning

The document is a lecture on the basics of deep learning, covering topics such as data analysis, machine learning approaches, deep learning algorithms, and the importance of data quality and structure. It discusses various model structures, activation functions, loss functions, and optimization techniques including SGD and Adam. Additionally, it highlights the significance of dataset splitting and mini-batching for efficient training of neural networks.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views34 pages

Lecture02-Basics of Deep Learning

The document is a lecture on the basics of deep learning, covering topics such as data analysis, machine learning approaches, deep learning algorithms, and the importance of data quality and structure. It discusses various model structures, activation functions, loss functions, and optimization techniques including SGD and Adam. Additionally, it highlights the significance of dataset splitting and mini-batching for efficient training of neural networks.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Advanced Natural Language Processing

Lecture 2: Basics of Deep Learning

陈冠华 CHEN Guanhua


Department of Statistics and Data Science
Content

• Introduction
• History
• Model
• Optimization
• Training
• Coding

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 2
Data Analyses

• To create a function to map an input X into an output Y,


• Examples:

• To create such a system, we can use


• Manual creation of rules
• Machine learning from paired data <X, Y>

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 3
Machine Learning

• Statistical approach
• Generalized linear model, linear regression, logistic regression
• Gaussian mixture models
• Support vector machine (SVM)
• Decision trees, random forests
• Deep learning approach
• Modeling with different deep neural networks

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 4
Why didn’t They Work Before?

• Datasets too small


• For machine translation, not really better until you have 1M+ parallel sentences
(and really need a lot more)
• Optimization not well understood
• Good initialization
• Momentum (Adagrad/Adam) work best out-of-the-box
• Other innovations
• Word embedding
• Dropout, layer normalization, residual connection
• Large-scale computing system

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 5
Deep Learning

• Modeling with deep neural networks


• Optimized on big data
• Research Contributions

Collect/create the
Design a model
dataset

Design a task Optimize a model

Analyze and
Design an evaluation
understand the model
metric
and results

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 6
Deep Learning Algorithm Sketch

• Create a model and define a loss


• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 7
Deep Learning Algorithm Sketch

Forward Pass

Python files
Data Neural Model Ground Truth • [Link]
Input Network Output (Golden Label) • [Link]
• [Link]
• [Link]
• [Link]
Backward Pass
Optimizer Training
Gradient Update Loss

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 8
Dataset Split

When creating a system, use three sets of data


• Training Set
• Generally larger dataset, used during system design, creation, and learning of
parameters.
• Development/validation Set
• Smaller dataset for testing different design decisions ("hyper-parameters").
• Test Set
• Dataset reflecting the final test scenario, do not use for making design decisions.

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 9
Data is Very Important

Data

Scale Quality

Correctnes
Diverse
s

Tasks Difficulties

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 10
Deep Learning Algorithm Sketch

• Create a model and define a loss


• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 11
Different Model Structures
Feed-forward NNs Recurrent NNs

Convolutional NNs
Transformer

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 12
Feed-forward Neural Network

• The units are connected with no cycles


• The outputs from units in each layer are passed to units in the next higher
layer
• No outputs are passed back to lower layers
• Fully-connected (FC) layers

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 13
Feed-forward Neural Network

𝑥1 h1(1) h2(1)

𝑥2 h1(2) h2(2)

𝑥3 h1(3) h2(3)

h1(4) h2(4)

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 14
Feed-forward Neural Network

𝑥1 h1(1) h2(1)

𝑥2 h1(2) h2(2)
(3) (2) (1 ) (2) (2) (2 ) (3 ) (2) (4 )
h1(3) h 2 = 𝑓 (𝑤 3 ,1 h1 +𝑤3 , 2 h1 +𝑤3 , 3 h1 +𝑤3 , 4 h1 )
𝑥3 h2(3)

Non-linearity (activation function) : or


h1(4) h2(4)

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 15
Feedforward Neural Network for
Classification
• Use softmax to get the probability distribution

Neural networks are difficult to optimize.


SGD can only converge to local
minimum. Initializations and optimizers
matter a lot!

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 16
Activation Functions

• Add non-linearities into neural networks


• Allowing the neural networks to learn powerful operations

1
𝑓 (𝑥 )= −𝑥 𝑓 ( 𝑥 ) =max ⁡( 𝑥 , 0)
1+ 𝑒

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 17
Activation Functions

• GeLU (Gaussian Error Linear Unit)


• Used in GPT-3, BERT, and many other models

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 18
Loss Functions

• Given a labeled example , we use a neural network to estimate the conditional


probability and predict the label as

• We compute how close our prediction w.r.t. the true label by a loss function
• Classification: Cross-Entropy

• Regression: L1 loss, L2 loss (a.k.a. Mean Square Error)

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 19
Deep Learning Algorithm Sketch

• Create a model and define a loss (i.e., construct a computation graph)


• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 20
Backpropagation

Forward propagation:
from input to output layer

Back propagation:
from output to input
layer

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 21
Back-propagation in PyTorch

PyTorch did back-propagation for you in this one line of code!

A toy pytorch example to train an NN model

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 22
Deep Learning Algorithm Sketch

• Create a model and define a loss (i.e., construct a computation graph)


• For each example
• Forward process: calculate the result (prediction & loss) of that example
• if training
• Perform back propagation
• Update parameters

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 23
Optimizer Update

• Most deep learning toolkits implement the parameter updates by calling


[Link]() function

Before optimizer update After optimizer update

Can be updated with standard SGD or Adam


optimizer

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 24
Standard SGD

• Standard stochastic gradient descent

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 25
Adam Optimizer

• Most standard optimization option in NLP and beyond


• Considers rolling average of gradient , and momentum

Momentum
Rolling Average of Gradient

Correction of bias
Further reading: how to use
the optimizer in
Pytorch
Final update the parameter

optimizer = [Link]([Link](), lr=0.0005, betas=(0.99,


0.999))

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 26
Adam Optimizer

• Gradient descent | Khan Academy


• Intuition of Adam Optimizer
• Blog: An updated overview of recent gradient descent algorithms
• (paper) Convex Optimization: Algorithms and Complexity
• Course: Optimization for Machine Learning

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 27
Learning Rate

learning rate schedule [another link], warmup

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 28
Tensors

• An n-dimensional array

• Widely used in neural networks


• Parameters in NNs consist of different shape of tensors, which store both their
values and gradients (e.g., x, [Link])

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 29
Tensor Operations

create tensors from list, [Link]

matrix multiply Element-wise matrix multiply

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 30
Efficiency Tricks: Mini-batching

• On modern hardware 10 operations of size 1 is much slower than 1 operation


of size 10
• Mini-batching combines together smaller operations into one big one
• About padding

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 31
Deep Learning Algorithm Sketch

Forward Pass

Python files
Data Neural Model Ground Truth • [Link]
Input Network Output (Golden Label) • [Link]
• [Link]
• [Link]
• [Link]
Backward Pass
Optimizer Training
Gradient Update Loss

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 32
Different Learnings
• Supervised/unsupervised learning
• Self-supervised learning
• Transfer learning
• Few-shot/zero-shot learning

Stanford STATS214 / CS229M: Machine Learning Theory

Guanhua Chen @ Stat-DS,


Department SUSTech and Data
of Statistics 33
Thank you

You might also like