Deep Learning
Deep Learning
• NSF/ICERM Workshop on SciML, January 28-30, 2019 (Organizers: J. Hesthaven & G.E. Karniadakis)
4 • https://icerm.brown.edu/events/ht19-1-sml/)
Fundamental Questions
Johns Hopkins
Turbulence Database
5
Workflow in a Neural Network
Input Hidden layers Output Data
Backward pass •
σ σ
σ σ
𝑥 ∆𝑦
σ σ
σ σ
6
A Neural Network for Regression
❑ Define the affine transformation in -th layer
❑ Activation function
Popular choices: (Rectified Linear Unit, ReLU)
❑ The hidden layers of a feedforward neural network:
Hidden layer
Input dim. d
width N
❑ Network parameters:
7
A Neural Network for Classification
❑ For classification, define the softmax function for classes
•
•
8
Building Different NNs: ResNet
❑ Replace with
❑ : Identity function
9
Universal Function Approximation (single layer)
Definition. We say that is discriminatory if for a measure The space of finite, signed
regular Borel measures on
is denoted by
are dense in . In other words, given any and , there is ➢ Note: The set of all functions y
a sum, , of the above form, for which does not form a vector space since
it is not closed under addition.
G. Cybenko, “Approximation by superpositions of a sigmoidal function”, Mathematics of Control, Signals and Systems,
10 303-314, 2(4), 1989
Universal Functional Approximation (single layer)
Such that
T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application to dynamic
systems, IEEE Transactions on Neural Networks, 910-918, 4(6), 1993.
11
Adaptive Basis Viewpoint
where and are the parameter corresponding to the final linear layer and the hidden layers
respectively. We interpret as a concatenation of and .
number of Hidden layers: L-1
Hidden layer
Input dim. d
width N
This view point makes it clear that parameterizes the basis (like FEM mesh & Shape functions), while
are just coefficients for these basis functions.
12
Shallow networks vs Deep networks
❑ Universal approximator:
❑ Shallow networks: width
❑ Deep networks: width (for ReLU NN) [Hanin & Sellke, 2017]
❑ From approximation point of view: Deep networks perform better than shallow ones of
comparable size [Mhaskar, 1996]
❑ neurons can approximate functions with error [Mhaskar, 1996]
❑ e.g., a 3-layer NN with 10 neurons per layer may be better than a 1-layer NN with 30
neurons
❑ To learn
❑ Given a dataset
14
Activation Functions Conventional Parameterized*
Tanh
Sigmoid
Leaky ReLU
Conventional Parameterized
16
ELU Conventional Parameterized
Swish
17
Differentiation: Four ways but only one counts: Automatic Differentiation (AD)
• Hand-coded analytical derivative
• Error prone
• Truncation errors
• Symbolic differentiation (used in software programs such as Mathematica, Maxima, Maple, and Python library SymPy)
• Chain rule
❑ High-order derivatives:
•
❑ Nested-derivative approach: Apply first-order AD repeatedly
❑ Cost scales exponentially in the order of differentiation
❑ What we will use in this class, because the simplicity of implementation
❑ More efficient approaches, such as Taylor-mode AD (high-order chain rule)
19 ❑ Not supported in TensorFlow/PyTorch yet
Backpropagation
❑ We apply recursively the Chain rule to implement Backprop
❑ Use computational graphs to accomplish backprop
❑ Example:
* *
+ *
𝑐+𝑎
𝜕𝑑 𝜕𝑑 𝜕𝑐
= ∗ =𝑎∗1=𝑎
𝜕𝑏 𝜕𝑐 𝜕𝑏
+ +
20
Backpropagation
21 Lu et al. DeepXDE: A deep learning library for solving differential equations. SIAM Review, 2021.
Gradient Descent (GD) ▪ Some GD pathologies in non-convex loss landscapes
Global minima
Learning rate is too small Learning rate is too large Convergence depends strongly on the lr
❑ Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set (low-capacity models).
❑ Overfitting occurs when the gap between the training error and test error is too high (high-capacity models).
Overfitting
❑ The predicted
function passes
through the 10
training points,
making training loss
24
24
Vanishing and Exploding Gradients
❑ Different layers may learn at hugely different rates: for most NN architectures the gradients become smaller
and smaller in back propagation, hence leaving the weights of lower layers unaffected (vanishing gradients). In
recurrent NN in addition the weights may explode.
❑ Exploding gradient: multiplying 100 Gaussian random matrices
(All linear layers)
❑ This was the main obstacle in training DNNs until the early 2000s.
❑ 2010: Breakthrough paper of Xavier Glorot & Youshua Bengio “Understanding the difficulty of training Deep
neural nerworks”, Proc 13th Int. Conf. on AI and Statistics pp. 249-256.
❑ The main reasons were the then popular sigmoid activation function and the normal distribution of initialized
weights . The variance of each layer from input to output increases monotonically and then the
activation function saturates at 0 and 1 in the deep layers. Note that the mean of this activation function is 0.5.
25
25
Xavier and He - Weight Initializations
❑ Gradients should have equal variance before and after flowing through a layer in the reverse
direction (fan-in/fan-out) – this led to Xavier (or Glorot) initializations.
❑ He Normal
“Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, 2015, IEEE Conference on Computer Vision, pp. 1026-
26 1034
Data Normalization Ref: https://zaffnet.github.io/batch-normalization
In 1998, Yan LeCun in his paper, Effiecient BackProp, highlighted the importance of normalizing the inputs.
Preprocessing of the inputs using normalization is a standard machine learning procedure and is known to
help in faster convergence. Normalization is done to achieve the following objectives:
❑ The average of each input variable (or feature) over the training set is close to zero (Mean subtraction).
Without data-normalization
With data-normalization
28
28
An Overview of Gradient Descent Optimization Algorithms
https://ruder.io/optimizing-gradient-descent/
https://arxiv.org/abs/1609.04747:
▪ This post explores how many of the most popular gradient-based optimization algorithms actually work.
A. This movie shows the behaviour of the algorithms at a saddle point Notice that B. In this movie, we see their behavior on the contours of a loss surface
SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop
eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta almost immediately head off in the right direction and converge similarly fast,
quickly head down the negative slope. while Momentum and NAG are led off-track, evoking the image of a ball rolling
down the hill. NAG, however, is quickly able to correct its course due to its
increased responsiveness by looking ahead and heads to the minimum.
➢ These two animations (Images credit: Alec Radford) provide some intuitions towards the optimization behavior of most of the presented optimization methods.
29
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer
❑ Adam = adaptive moment estimation is a hybrid method and combines the ideas of momentum optimization and RMSProp.
❑ Similar to momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like
RMSProp, it keeps track of an exponentially decaying average of past squared gradients.
❑ Steps 1, 2, and 5 in algorithm (below) reveal Adam’s close similarity to both momentum optimization and RMSProp.
30
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer
❑ Steps 3 and 4 can be explained as follows: since m and s are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the beginning of training.
❑ The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often
initialized to 0.999. The smoothing term ε is usually initialized to a small number such as 10-7.
❑ Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter η. We can
often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.
31
What Optimizer to Use? ▪ Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization.
Mathematical programming. 1989 Aug;45(1):503-28.
L-BFGS optimizer
❑ BFGS is the most popular of all Quasi-Newton methods and have storage complexity.
❑L-BFGS (Limited memory BFGS), which does not require to explicitly to store but instead stores the previous data
and manages to compute directly from this data. L-BFGS has storage complexity of .
❑ L-BFGS implementation is not straightforward in PyTorch and TF2. A detailed implementation will be discussed in Lecture
4, but here we provide a simple API for both.
tfp.optimizer.lbfgs_minimize(f,
torch.optim.LBFGS(params, lr=1, max_iter= initial_position=self.get_weights(),
20, max_eval=None, tolerance_grad=1e- num_correction_pairs=50,
07, tolerance_change=1e- max_iterations=2000)
09, history_size=100, line_search_fn=None
)
32
32
Loss Regularizers
33
33
Loss Regularizers:
34
34
Collapse of Deep and Narrow ReLU Neural Networks
L. Lu, Y. Shin, Y. Su, & G. E. Karniadakis. Dying ReLU and initialization: Theory and numerical examples.
35
Communications in Computational Physics, 28(5), 1671–1706, 2020.
35
ReLU Collapse: One-Dimensional Examples
36
36
ReLU Collapse: One-Dimensional Examples
37
37
ReLU Collapse: Two-Dimensional Example
38
38
Does the Loss Type Matter?
39
39
Theoretical Analysis - I
40
40
Theoretical Analysis - I
41
41
Theoretical Analysis - II
42
42
Theoretical Analysis - III
43
43
Theoretical versus Numerical Results
44
Theoretical versus Numerical Results
45
45
Motivation and need for other neural networks
Cause: Diverse data formats and learning tasks
Rescue: Different neural network architectures
Datatype: Tabular data Datatype: Image data Datatype: Time series data
Velocity and pressure for fluid flow
Breast Cancer Wisconsin (Prognostic) Data Set Contiguous U.S. Average Temperature
46
Convolutional Neural Network (CNN)
depth
height 8 filters
Size = 2x2
width
47
FNN vs CNN
FNN CNN
48
CNN kernel vs finite difference stencil
Partial derivative
2D Laplacian (𝚫)
49
Resources
• Deep Learning. Ian Goodfellow, Yoshua Bengio, & Aaron Courville.
MIT Press, 2016. http://www.deeplearningbook.org
• Dive into Deep Learning. https://d2l.ai
• https://github.com/lululxvi/tutorials