0% found this document useful (0 votes)

87 views50 pages

Deep Learning

Uploaded by

cilimian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views50 pages

Deep Learning

Uploaded by

cilimian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning

Concepts & Algorithms

Lu Lu

Department of Chemical and Biomolecular Engineering

Penn Institute for Computational Science
University of Pennsylvania

Tianyuan Mathematical Center in Southeast China

Dec 8, 2021
Deep Learning for Science and Engineering Teaching Kit

Deep Learning for

Scientists and Engineers
Instructors: George Em Karniadakis, Khemraj Shukla, Lu Lu
Teaching Assistants: Vivek Oommen and Aniruddha Bora

(To be released in 2022)

https://www.import.io/post/history-of-deep-learning/
History of Deep Learning
❑ Artificial intelligence (AI) > Machine Learning (ML) > Deep Learning > Scientific Machine Learning (SciML).
❑ The expression “Deep Learning” was (probably) first used by Igor Aizenberg and colleagues around 2000.
❑ 1960s: Shallow Neural Networks.
❑ 1982: Hopfield Network – A Recurrent NN.
❑ 1988-89: Learning by backpropagation, Rumelhart, Hinton & Williams; hand-written text, LeCun.
❑ 1993 NVIDIA was founded; GeForce is the first GPU.
❑ 1990s Unsupervised Deep Learning.
❑ 1993: A Recurrent NN with 1,000 layers (Jürgen Schmidhuber) Artificial
❑ 1994: NN for solving PDEs, Dissanayake & Phan-Thien Intelligence
❑ 1998: Gradient-based learning, LeCun.
❑ 1998: ANN for solving ODEs&PDEs, Lagaris, Likas & Fotiadis Machine
Learning
❑ 1990-2000: Supervised Deep Learning.
❑ 2006: A fast learning algorithm for deep belief nets, Hinton.
❑ 2006-present: Modern Deep Learning. Neural Networks
❑ 2009: ImageNet: A large-scale hierarchical image database (Fei Fei).
❑ 2010: GPUs are only up to 14 times faster than CPUs (Intel). Scientific
Machine
❑ 2010: Tackling the vanishing/exploding gradients: Glorot & Bengio. Learning
❑ 2011: AlexNet – Convolutional NN (CNN) - Alex Krizhevsky.
❑ 2014: Generative Adversarial Networks (GANs) – Ian Goodfellow.
❑ 2015: Batch normalization, Ioffe & Szegedy.
❑ 2017: PINNs: Physics-Informed Neural Networks (Raissi, Perdikaris, Karniadakis).
❑ 2019: Scientific Machine Learning (ICERM workshop Jan. 2019; DOE report, Feb 2019).
❑ 2019: DeepOnet – Operator regression (Lu, Jin, Karniadakis).
3
Basic Research Needs Workshop for Scientific Machine Learning
Core Technologies for Artificial Intelligence, DOE ASCR Report, Feb 2019
❑ Scientific machine learning (SciML) is a core component of artificial intelligence (AI) and a computational
technology that can be trained, with scientific data, to augment or automate human skills.

❑ SciML must achieve the same level of scientific

rigor expected of established methods deployed
in science and applied mathematics. Basic
requirements include validation and limits on
inputs and context implicit in such validations, as
well as verification of the basic algorithms to
ensure they are capable of delivering known
prototypical solutions.

❑ Can SciML achieve Robustness?

• NSF/ICERM Workshop on SciML, January 28-30, 2019 (Organizers: J. Hesthaven & G.E. Karniadakis)
4 • https://icerm.brown.edu/events/ht19-1-sml/)
Fundamental Questions

Johns Hopkins
Turbulence Database

5
Workflow in a Neural Network
Input Hidden layers Output Data

σ σ • Input layer (layer 0):

•
σ σ
𝑥 𝑦 𝑦∗
• Hidden layers:
σ σ
• Layer 1:
• Layer 2:
σ σ
Forward pass • Output layer (layer 3):

Backward pass •

σ σ

σ σ
𝑥 ∆𝑦
σ σ

σ σ
6
A Neural Network for Regression
❑ Define the affine transformation in -th layer

❑ Activation function
Popular choices: (Rectified Linear Unit, ReLU)
❑ The hidden layers of a feedforward neural network:

Where denotes composition of functions

❑ For regression, a DNN Is typically of the form:

number of Hidden layers: L-1

Hidden layer

Input dim. d
width N

❑ Network parameters:
7
A Neural Network for Classification
❑ For classification, define the softmax function for classes

•
•

• Convert any vector to a probability vector

❑ The DNN is typically of the form

8
Building Different NNs: ResNet

❑ Residual network (ResNet)

❑ Replace with

❑ : Identity function

Skipping multiple layers

Skip
connection

9
Universal Function Approximation (single layer)
Definition. We say that is discriminatory if for a measure The space of finite, signed
regular Borel measures on
is denoted by

for all and implies that

Definition. We say that is sigmoidal if

Theorem 1. Let be any continuous discriminatory function. Then finite sums

of the form

are dense in . In other words, given any and , there is ➢ Note: The set of all functions y
a sum, , of the above form, for which does not form a vector space since
it is not closed under addition.

G. Cybenko, “Approximation by superpositions of a sigmoidal function”, Mathematics of Control, Signals and Systems,
10 303-314, 2(4), 1989
Universal Functional Approximation (single layer)

Theorem (Chen and Chen, 1993):

Suppose that is a compact set in is a continuous
functional defined on , and is a bounded sigmoidal function,
then for any , there exist points ,
a positive integer and constants

Such that

holds for all .

T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application to dynamic
systems, IEEE Transactions on Neural Networks, 910-918, 4(6), 1993.
11
Adaptive Basis Viewpoint

We consider a family of neural networks consisting of hidden layers of width

composed with a final linear layer, admitting the representation

where and are the parameter corresponding to the final linear layer and the hidden layers
respectively. We interpret as a concatenation of and .
number of Hidden layers: L-1
Hidden layer

Input dim. d
width N

This view point makes it clear that parameterizes the basis (like FEM mesh & Shape functions), while
are just coefficients for these basis functions.
12
Shallow networks vs Deep networks
❑ Universal approximator:
❑ Shallow networks: width
❑ Deep networks: width (for ReLU NN) [Hanin & Sellke, 2017]

❑ From approximation point of view: Deep networks perform better than shallow ones of
comparable size [Mhaskar, 1996]
❑ neurons can approximate functions with error [Mhaskar, 1996]
❑ e.g., a 3-layer NN with 10 neurons per layer may be better than a 1-layer NN with 30
neurons

❑ [Mhaskar & Poggio, 2016]

❑ There exist functions expressible by a small 2-hidden-layer NN, which cannot be

approximated by any shallow NN with the same accuracy, unless its width is
exponential in the dimension. [Eldan & Shamir, 2016]
❑ The number of neurons needed by a shallow NN to approximate a function is
exponentially larger than the number of neurons needed by a deep NN for a given
accuracy level. [Liang & Srikant, 2017; Yarotsky, 2017]
13
Loss Functions

❑ To learn

❑ Given a dataset

❑ Mean Squared Error (MSE) loss:

•

❑ In general, let be a linear/nonlinear operators

• MSE is a special case with be the identity

• PINN loss uses the PDE residual as the operator

14
Activation Functions Conventional Parameterized*

Tanh

Sigmoid

* In parametrization activation function consider , where is the trainable parameter.

15
ReLU

Leaky ReLU
Conventional Parameterized

16
ELU Conventional Parameterized

Swish

17
Differentiation: Four ways but only one counts: Automatic Differentiation (AD)
• Hand-coded analytical derivative

• Lots of human labor

• Error prone

• Numerical approximations, e.g., finite difference

• Two function evaluations (forward pass) per partial derivative

• Truncation errors

• Symbolic differentiation (used in software programs such as Mathematica, Maxima, Maple, and Python library SymPy)

• Chain rule

• Expression swell: Easily produce exponentially large symbolic representations

• Automatic differentiation (AD; also called algorithmic differentiation)

• Symbolic differentiation simplified by numerical evaluation of intermediate sub-expressions

• Does not provide a general analytical expression for the derivative

• But only the value of the derivative for a specific input 𝑥

18
Automatic Differentiation
❑ Exploits the fact that all computations are compositions of a small set of elementary expressions with known
derivatives
❑ Employs the chain rule to combine these elementary derivatives of the constituent expressions.

❑ Two ways to compute first-order derivative:

❑ Forward mode AD (details not discussed)
❑ Cost scales linearly w.r.t. the input dimension *
❑ Cost is constant w.r.t. the output dimension
❑ Reverse mode AD
❑ Cost is constant w.r.t. the input dimension
❑ Cost scales linearly w.r.t. the output dimension

❑ In deep learning, backpropagation == Reverse mode AD

❑ The input dimension of the loss function is # of parameters, e.g., millions +
❑ The output dimension is 1: the loss value

❑ High-order derivatives:
•
❑ Nested-derivative approach: Apply first-order AD repeatedly
❑ Cost scales exponentially in the order of differentiation
❑ What we will use in this class, because the simplicity of implementation
❑ More efficient approaches, such as Taylor-mode AD (high-order chain rule)
19 ❑ Not supported in TensorFlow/PyTorch yet
Backpropagation
❑ We apply recursively the Chain rule to implement Backprop
❑ Use computational graphs to accomplish backprop

❑ Example:

Forward pass Backward pass

By chain rule:

* *
+ *

𝑐+𝑎

𝜕𝑑 𝜕𝑑 𝜕𝑐
= ∗ =𝑎∗1=𝑎
𝜕𝑏 𝜕𝑐 𝜕𝑏
+ +

20
Backpropagation

Example: A NN with one hidden neuron

Forward Pass Backward Pass

21 Lu et al. DeepXDE: A deep learning library for solving differential equations. SIAM Review, 2021.
Gradient Descent (GD) ▪ Some GD pathologies in non-convex loss landscapes

Global minima

tf.keras.optimizers.SGD(learning_rate=0.01) torch.optim.SGD(params, lr=0.01)

22
22
Effect of Learning Rate
❑In linear regression we have convexity (hence global minimum) but still we should scale all features for faster convergence
Loss plot associated with learning
the sine function with noise using a
fully connected neural network

Learning rate is too small Learning rate is too large Convergence depends strongly on the lr

❑ An effective strategy is to use a variable/decaying learning rate

23
23
Underfitting vs. Overfitting

❑ Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set (low-capacity models).

❑ Overfitting occurs when the gap between the training error and test error is too high (high-capacity models).

Overfitting

❑ The neural network is

forced to overfit by
considering only 10
training points

❑ The predicted
function passes
through the 10
training points,
making training loss

❑ Model fails to learn

the underlying
function

24
24
Vanishing and Exploding Gradients

❑ Different layers may learn at hugely different rates: for most NN architectures the gradients become smaller
and smaller in back propagation, hence leaving the weights of lower layers unaffected (vanishing gradients). In
recurrent NN in addition the weights may explode.
❑ Exploding gradient: multiplying 100 Gaussian random matrices
(All linear layers)

❑ This was the main obstacle in training DNNs until the early 2000s.

❑ 2010: Breakthrough paper of Xavier Glorot & Youshua Bengio “Understanding the difficulty of training Deep
neural nerworks”, Proc 13th Int. Conf. on AI and Statistics pp. 249-256.

❑ The main reasons were the then popular sigmoid activation function and the normal distribution of initialized
weights . The variance of each layer from input to output increases monotonically and then the
activation function saturates at 0 and 1 in the deep layers. Note that the mean of this activation function is 0.5.

25
25
Xavier and He - Weight Initializations

❑ Variance of output of each layer = Variance of inputs to that layer.

❑ Gradients should have equal variance before and after flowing through a layer in the reverse
direction (fan-in/fan-out) – this led to Xavier (or Glorot) initializations.

❑ He Normal

❑ He initialization is similar with ReLU

Glorot Initialization He Initialization

Glorot Initialization He Initialization tf.initializers.GlorotNormal() tf.initializers.HeNormal()
torch.nn.init.xavier_normal_(w) torch.nn.init.kaiming_normal_(w)
Example: Example:
Example: Example: init = tf.initializers.GlorotNormal() init = tf.initializers.HeNormal()
w = torch.empty(5, 5) w = torch.empty(5, 5) w = initializer(shape=(4, 4)) w = initializer(shape=(4, 4))
nn.init.xavier_normal_(w) torch.nn.init. kaiming_normal_(w)

“Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, 2015, IEEE Conference on Computer Vision, pp. 1026-
26 1034
Data Normalization Ref: https://zaffnet.github.io/batch-normalization

In 1998, Yan LeCun in his paper, Effiecient BackProp, highlighted the importance of normalizing the inputs.
Preprocessing of the inputs using normalization is a standard machine learning procedure and is known to
help in faster convergence. Normalization is done to achieve the following objectives:

❑ The average of each input variable (or feature) over the training set is close to zero (Mean subtraction).

❑ Covariances of the features are same (Scaling).

❑ Decorrelate the features (Whitening – not required for CNNs).

27
27
Data Normalization - Example

Without data-normalization

With data-normalization

28
28
An Overview of Gradient Descent Optimization Algorithms
https://ruder.io/optimizing-gradient-descent/
https://arxiv.org/abs/1609.04747:
▪ This post explores how many of the most popular gradient-based optimization algorithms actually work.

A. This movie shows the behaviour of the algorithms at a saddle point Notice that B. In this movie, we see their behavior on the contours of a loss surface
SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop
eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta almost immediately head off in the right direction and converge similarly fast,
quickly head down the negative slope. while Momentum and NAG are led off-track, evoking the image of a ball rolling
down the hill. NAG, however, is quickly able to correct its course due to its
increased responsiveness by looking ahead and heads to the minimum.

➢ These two animations (Images credit: Alec Radford) provide some intuitions towards the optimization behavior of most of the presented optimization methods.
29
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer

❑ Adam = adaptive moment estimation is a hybrid method and combines the ideas of momentum optimization and RMSProp.

❑ Similar to momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like
RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

❑ Steps 1, 2, and 5 in algorithm (below) reveal Adam’s close similarity to both momentum optimization and RMSProp.

# Import Optimizers bundle # Import Optimizers bundle

opt_class=torch.optim opt_class=tf.keras.optimizers
# Momentum optimizer # Momentum optimizer
opt= opt_class.Adam(model.parameters(),lr=0.01,betas=(0.9,0.999))opt=opt_class.Adam(lr=0.001, beta_1=0.9,
beta_2=0.999)

30
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer

❑ Steps 3 and 4 can be explained as follows: since m and s are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the beginning of training.

❑ The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often
initialized to 0.999. The smoothing term ε is usually initialized to a small number such as 10-7.

❑ Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter η. We can
often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

# Import Optimizers bundle # Import Optimizers bundle

31
What Optimizer to Use? ▪ Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization.
Mathematical programming. 1989 Aug;45(1):503-28.
L-BFGS optimizer

❑ BFGS is the most popular of all Quasi-Newton methods and have storage complexity.

❑L-BFGS (Limited memory BFGS), which does not require to explicitly to store but instead stores the previous data
and manages to compute directly from this data. L-BFGS has storage complexity of .

❑ L-BFGS implementation is not straightforward in PyTorch and TF2. A detailed implementation will be discussed in Lecture
4, but here we provide a simple API for both.

L-BFGS implementation in TensorFlow is provided

through TF Probability package

tfp.optimizer.lbfgs_minimize(f,
torch.optim.LBFGS(params, lr=1, max_iter= initial_position=self.get_weights(),
20, max_eval=None, tolerance_grad=1e- num_correction_pairs=50,
07, tolerance_change=1e- max_iterations=2000)
09, history_size=100, line_search_fn=None
)

32
32
Loss Regularizers

33
33
Loss Regularizers:

34
34
Collapse of Deep and Narrow ReLU Neural Networks

L. Lu, Y. Shin, Y. Su, & G. E. Karniadakis. Dying ReLU and initialization: Theory and numerical examples.
35
Communications in Computational Physics, 28(5), 1671–1706, 2020.
35
ReLU Collapse: One-Dimensional Examples

36
36
ReLU Collapse: One-Dimensional Examples

37
37
ReLU Collapse: Two-Dimensional Example

38
38
Does the Loss Type Matter?

39
39
Theoretical Analysis - I

40
40
Theoretical Analysis - I

41
41
Theoretical Analysis - II

42
42
Theoretical Analysis - III

43
43
Theoretical versus Numerical Results

44
Theoretical versus Numerical Results

45
45
Motivation and need for other neural networks
Cause: Diverse data formats and learning tasks
Rescue: Different neural network architectures

Datatype: Tabular data Datatype: Image data Datatype: Time series data
Velocity and pressure for fluid flow
Breast Cancer Wisconsin (Prognostic) Data Set Contiguous U.S. Average Temperature

46
Convolutional Neural Network (CNN)

depth
height 8 filters
Size = 2x2

width

47
FNN vs CNN
FNN CNN

48
CNN kernel vs finite difference stencil
Partial derivative

2D Laplacian (𝚫)

Standard 5-point stencil

finite difference kernel
Learned kernel

49
Resources
• Deep Learning. Ian Goodfellow, Yoshua Bengio, & Aaron Courville.
MIT Press, 2016. http://www.deeplearningbook.org
• Dive into Deep Learning. https://d2l.ai
• https://github.com/lululxvi/tutorials

Neural Networks
No ratings yet
Neural Networks
108 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
A Survey On UATs
No ratings yet
A Survey On UATs
10 pages
Short Course Machine Learning F de Vuyst 1715052496
No ratings yet
Short Course Machine Learning F de Vuyst 1715052496
74 pages
AI - Physics Informed Neural Network by ARNAB HALDER
100% (1)
AI - Physics Informed Neural Network by ARNAB HALDER
15 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
CV 3
No ratings yet
CV 3
159 pages
Unit 3 Self Made
No ratings yet
Unit 3 Self Made
23 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
Deep Learning in Probabilistic Models
No ratings yet
Deep Learning in Probabilistic Models
57 pages
Building Blocks of Neural Networks
No ratings yet
Building Blocks of Neural Networks
61 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
60 pages
Neural ODES
No ratings yet
Neural ODES
32 pages
Unit I
No ratings yet
Unit I
90 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
Mathematics of Deep Learning Lecture Notes
No ratings yet
Mathematics of Deep Learning Lecture Notes
58 pages
Macro Finance
No ratings yet
Macro Finance
119 pages
Artificial Neural Networks & Fuzzy Logic
No ratings yet
Artificial Neural Networks & Fuzzy Logic
13 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
DL Unit 1
No ratings yet
DL Unit 1
10 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
Foundations of Neural Networks in ML
No ratings yet
Foundations of Neural Networks in ML
52 pages
Deep Learning Basics and Applications
No ratings yet
Deep Learning Basics and Applications
10 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
NN Unit - 1
100% (1)
NN Unit - 1
27 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Kannan M5L3 Notes
No ratings yet
Kannan M5L3 Notes
98 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
Genetic Algorithms Versus Traditional Methods
No ratings yet
Genetic Algorithms Versus Traditional Methods
7 pages
NN PDF
No ratings yet
NN PDF
23 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Guilhoto Math
No ratings yet
Guilhoto Math
25 pages
Neural Networks Basics & Training
No ratings yet
Neural Networks Basics & Training
8 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
2K21 - Ee - 192 MLP
No ratings yet
2K21 - Ee - 192 MLP
59 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
Deep Learning in Communication Systems
No ratings yet
Deep Learning in Communication Systems
10 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Understanding Neurons and Perceptrons
No ratings yet
Understanding Neurons and Perceptrons
23 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
Unit 3
No ratings yet
Unit 3
12 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Neural Networks: Structure & Training
No ratings yet
Neural Networks: Structure & Training
52 pages
AIML Roadmap
No ratings yet
AIML Roadmap
6 pages
Results of M.tech II Semester (R19) RegSupple. Examinations, Aug-2025
No ratings yet
Results of M.tech II Semester (R19) RegSupple. Examinations, Aug-2025
5 pages
Comparative Study Between Vision Transformer and EfficientNet
No ratings yet
Comparative Study Between Vision Transformer and EfficientNet
5 pages
Semantic Segmentation Strategies
No ratings yet
Semantic Segmentation Strategies
16 pages
Multi-layer Neural Networks Basics
No ratings yet
Multi-layer Neural Networks Basics
55 pages
Signature Forgery Detection
No ratings yet
Signature Forgery Detection
8 pages
AI for IoT Laboratory Course Guide
No ratings yet
AI for IoT Laboratory Course Guide
4 pages
Summer Training
No ratings yet
Summer Training
16 pages
498 FA2019 Lecture01
No ratings yet
498 FA2019 Lecture01
61 pages
Preboard-1 PAPERS 10 SET-1 - Answer Key
No ratings yet
Preboard-1 PAPERS 10 SET-1 - Answer Key
7 pages
Deep Learning Exam: B.E. IT Semester VII
No ratings yet
Deep Learning Exam: B.E. IT Semester VII
2 pages
Sentiment Analysis Techniques Review
No ratings yet
Sentiment Analysis Techniques Review
6 pages
Federated Inception-Multi-Head Attention Models For Cyber Attacks Detection
No ratings yet
Federated Inception-Multi-Head Attention Models For Cyber Attacks Detection
17 pages
Day Wise Agenda
No ratings yet
Day Wise Agenda
1 page
ML Interview Notes
No ratings yet
ML Interview Notes
3 pages
Plant Disease Detection Using AutoML and Deep Learning 1
No ratings yet
Plant Disease Detection Using AutoML and Deep Learning 1
12 pages
Assignment ChatGPT With AI
No ratings yet
Assignment ChatGPT With AI
3 pages
A676e311975a5c Aniket Chhabra Resume
No ratings yet
A676e311975a5c Aniket Chhabra Resume
2 pages
Machine Learning Midterm Review
No ratings yet
Machine Learning Midterm Review
19 pages
Contoh Soal N Gram (Bagus)
No ratings yet
Contoh Soal N Gram (Bagus)
2 pages
LLM - Neural Language Models
No ratings yet
LLM - Neural Language Models
28 pages
ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
No ratings yet
ERQ: Error Reduction For Post-Training Quantization of Vision Transformers
17 pages
Wa0007
No ratings yet
Wa0007
23 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
4 pages
Techfest
No ratings yet
Techfest
18 pages
Hybrid Machine Learning for E-Commerce Classification
No ratings yet
Hybrid Machine Learning for E-Commerce Classification
16 pages
DL Jun - 2023
No ratings yet
DL Jun - 2023
2 pages
Core I
No ratings yet
Core I
2 pages
Analyzing The Effectiveness of Image Augmentation
No ratings yet
Analyzing The Effectiveness of Image Augmentation
6 pages
Effective Intrusion Detection in IoT Env
No ratings yet
Effective Intrusion Detection in IoT Env
8 pages

Deep Learning

Uploaded by

Deep Learning

Uploaded by

Deep Learning

Concepts & Algorithms

Department of Chemical and Biomolecular Engineering

Tianyuan Mathematical Center in Southeast China

Deep Learning for

(To be released in 2022)

❑ SciML must achieve the same level of scientific

❑ Can SciML achieve Robustness?

σ σ • Input layer (layer 0):

Where denotes composition of functions

❑ For regression, a DNN Is typically of the form:

• Convert any vector to a probability vector

❑ The DNN is typically of the form

❑ Residual network (ResNet)

Skipping multiple layers

for all and implies that

Definition. We say that is sigmoidal if

Theorem 1. Let be any continuous discriminatory function. Then finite sums

Theorem (Chen and Chen, 1993):

holds for all .

We consider a family of neural networks consisting of hidden layers of width

❑ [Mhaskar & Poggio, 2016]

❑ There exist functions expressible by a small 2-hidden-layer NN, which cannot be

❑ Mean Squared Error (MSE) loss:

❑ In general, let be a linear/nonlinear operators

• MSE is a special case with be the identity

* In parametrization activation function consider , where is the trainable parameter.

• Lots of human labor

• Numerical approximations, e.g., finite difference

• Two function evaluations (forward pass) per partial derivative

• Expression swell: Easily produce exponentially large symbolic representations

• Automatic differentiation (AD; also called algorithmic differentiation)

• Symbolic differentiation simplified by numerical evaluation of intermediate sub-expressions

• Does not provide a general analytical expression for the derivative

• But only the value of the derivative for a specific input 𝑥

❑ Two ways to compute first-order derivative:

❑ In deep learning, backpropagation == Reverse mode AD

Forward pass Backward pass

Example: A NN with one hidden neuron

Forward Pass Backward Pass

tf.keras.optimizers.SGD(learning_rate=0.01) torch.optim.SGD(params, lr=0.01)

❑ An effective strategy is to use a variable/decaying learning rate

❑ The neural network is

❑ Model fails to learn

❑ Variance of output of each layer = Variance of inputs to that layer.

❑ He initialization is similar with ReLU

Glorot Initialization He Initialization

❑ Covariances of the features are same (Scaling).

❑ Decorrelate the features (Whitening – not required for CNNs).

# Import Optimizers bundle # Import Optimizers bundle

# Import Optimizers bundle # Import Optimizers bundle

L-BFGS implementation in TensorFlow is provided

Standard 5-point stencil

You might also like