0% found this document useful (0 votes)

14 views14 pages

Physics of Transfer Learning

This research report discusses the application of transfer learning (TL) in enhancing neural networks for turbulence modeling and climate prediction. It presents a novel framework that combines spectral analyses to understand the physics learned during TL and identifies optimal re-training procedures, particularly emphasizing the effectiveness of re-training shallow layers. The findings suggest that this approach can significantly improve out-of-distribution generalization in complex dynamical systems, paving the way for more interpretable and generalizable neural networks in scientific applications.

Uploaded by

Ramzi Sofiane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

Physics of Transfer Learning

Uploaded by

Ramzi Sofiane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PNAS Nexus, 2023, 2, 1–14

[Link]
Advance access publication 23 January 2023
Research Report

Explaining the physics of transfer learning in data-driven

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
turbulence modeling
Adam Subela,1, Yifei Guan a
, Ashesh Chattopadhyaya and Pedram Hassanzadeh a,b,
*
a
Department of Mechanical Engineering, Rice University, Houston, TX 77005, USA
b
Department of Earth, Environmental and Planetary Sciences, Rice University, Houston, TX 77005, USA
*To whom correspondence should be addressed: Email: pedram@[Link]
1
Present address: Courant Institute of Mathematical Sciences, New York University, 10012 NY, USA
Edited By: Yannis Yortsos

Abstract
Transfer learning (TL), which enables neural networks (NNs) to generalize out-of-distribution via targeted re-training, is becoming a
powerful tool in scientific machine learning (ML) applications such as weather/climate prediction and turbulence modeling. Effective
TL requires knowing (1) how to re-train NNs? and (2) what physics are learned during TL? Here, we present novel analyses and a
framework addressing (1)–(2) for a broad range of multi-scale, nonlinear, dynamical systems. Our approach combines spectral (e.g.
Fourier) analyses of such systems with spectral analyses of convolutional NNs, revealing physical connections between the systems
and what the NN learns (a combination of low-, high-, band-pass filters and Gabor filters). Integrating these analyses, we introduce a
general framework that identifies the best re-training procedure for a given problem based on physics and NN theory. As test case, we
explain the physics of TL in subgrid-scale modeling of several setups of 2D turbulence. Furthermore, these analyses show that in
these cases, the shallowest convolution layers are the best to re-train, which is consistent with our physics-guided framework but is
against the common wisdom guiding TL in the ML literature. Our work provides a new avenue for optimal and explainable TL, and a
step toward fully explainable NNs, for wide-ranging applications in science and engineering, such as climate change modeling.

Keywords: transfer learning, neural networks, subgrid-scale parameterization, turbulence modeling, climate modeling

Significance Statement
The use of deep neural networks (NNs) in critical applications such as weather/climate prediction and turbulence modeling is growing
rapidly. Transfer learning (TL) is a technique that enhances NNs’ capabilities, e.g. enabling them to extrapolate from one system to
another. This is crucial in applications such as climate change prediction, where the system substantially evolves in time. For effective
and reliable TL, we need to (a) understand physics that is learned in TL, and (b) have a framework guiding the TL procedure. Here, we
present novel analysis techniques and a general framework for (a)–(b) applicable to a broad range of multi-scale, nonlinear dynamical
systems. This is a major step toward developing interpretable and generalizable NNs for scientific machine learning.

Introduction essential for NNs to be practically useful in many applications.

There are ever-growing efforts focused on using machine learning For instance, NN-based SGS closures (i.e. data-driven parameter
(ML), particularly the powerfully expressive deep neural networks izations) should work accurately for a range of climates to be use
(NNs), to improve simulations or predictions of nonlinear, multi- ful for global warming projections. If this were not the case, once
scale, high-dimensional systems. For example, in thermo-fluid some parameters (e.g. sea-surface temperature or forcing)
sciences and in weather/climate modeling, a number of different change, the data-driven closures may lead to unstable or inaccur
approaches using NNs have shown significant promise for fully ate simulations [11, 18, 17]. Studies have found a similar challenge
data-driven forecasting, subgrid-scale (SGS) closure modeling, arising across thermo-fluid applications [22, 19, 20, 23, 21].
and novel ways of solving partial differential equations (PDEs) Transfer learning (TL) provides a powerful and flexible frame
[12, 3, 6, 11, 2, 1, 5, 10, 13, 4, 14, 7–9]. However, one major challenge work for improving the out-of-distribution generalization of NNs,
facing such efforts is the inability of NNs, and more broadly ML and has shown success in various ML applications [24, 25, 16].
techniques, to generalize out-of-distribution, i.e. to perform equal Consider an NN that is already trained on a large-enough number
ly well when tested on a dataset whose distribution (or some of training samples (Mtr) from a base system and makes predictions
measure of its statistics) is different from the training set [16, with sufficient out-of-sample accuracy. We hereafter refer to this
15].a Some degree of such out-of-distribution generalization is network as a base NN (BNN). The goal of TL is to build a new NN

Competing Interest: The authors declare no competing interest.

Received: September 12, 2022. Revised: December 9, 2022. Accepted: January 12, 2023
© The Author(s) 2023. Published by Oxford University Press on behalf of National Academy of Sciences. This is an Open Access article
distributed under the terms of the Creative Commons Attribution License ([Link] which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 | PNAS Nexus, 2023, Vol. 2, No. 3

from a BNN that works with similar accuracy for a target system In this paper, we use CNN-based non-local SGS closure modeling
whose statistical properties could be different from those of the for LES of several setups of forced 2D turbulence as the test case. We
base system. For instance, this could be because of a change in first demonstrate the power of TL in enabling out-of-distribution
physical properties (e.g. in the context of turbulence, an increase generalization to 100 × higher Re numbers, and even more challen
in Reynolds number, Re) or in external forcing (e.g. in the context ging target flows. We further show that here, against the conven
of climate change, a higher radiative forcing due to increased tional wisdom in the ML literature, the shallowest layers are the

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
greenhouse gases). We refer to this network as a TLNN. In TL, a best to re-train. Next, we leverage the fundamentals of turbulence
(usually small) number of the layers of the BNN are re-trained, physics and recent theoretical advances in ML to
starting from their current weights, with a small number of re-
training samples from the target system (e.g. Mtr/10 or Mtr/100 sam 1. explain what is learned during TL to a different turbulent
ples). The TL procedure, if properly formulated (as discussed later), flow, which is based around changes in the convolution ker
can produce a TLNN whose out-of-sample accuracy for the target nels of the BNN after re-training to the TLNN, and these ker
system is comparable to that of the BNN, despite using only a small nels’ physical interpretation,
amount of re-training data from the target system. 2. explain why the shallowest layers, rather than the deepest
In thermo-fluid sciences and weather/climate modeling, a few ones, are the best to re-train in these setups,
studies have reported such success with TL for SGS closure mod 3. introduce a general framework to guide TL of similar systems
eling and spatio-temporal forecasting [18, 28, 22, 21, 27, 29, 26]. For based on a number of analysis steps that could be performed
example, in data-driven closure modeling with a convolutional before re-training any TLNN.
NN (CNN) for large-eddy simulation (LES) of decaying 2D turbu
lence, Guan et al. [21] showed stable and accurate a posteriori (on
While we use the SGS modeling of canonical 2D turbulence as the
line)b LES using only Mtr/100 re-training samples from a target
test case, the methods used for (1)–(2) and the framework in (3)
system that had a 16 × higher Re number. Aside from enabling
can be readily applied to any other TL applications in turbulence
generalization for one system when parameters change, TL can
or weather/climate modeling. More broadly, this framework can
also be used to effectively blend datasets of different quality and
be used for TL applications beyond SGS modeling and for any
length for training, e.g. a large, high-fidelity training set from high-
multi-scale, nonlinear, high-dimensional dynamical systems.
resolution simulations and a very small but higher-quality re-
training set from observations/experiments or much higher- 2D turbulence: DNS and LES
resolution simulations [5, 32, 31, 30]. Such an application of TL
The dimensionless governing equations of 2D turbulence in a dou
in blending large climate model outputs and small observational
bly periodic square domain are:
datasets has shown promising results in forecasting El Niño–
Southern Oscillation and daily weather [5, 32, 33]. Even further, ∂ω ∂ψ ∂ω ∂ψ ∂ω
+ −
TL has been suggested as a way to improve the training of ∂t ∂y ∂x ∂x ∂y
􏽼��􏽻􏽺��􏽽
physics-informed NNs, a novel PDE-solving technique [35, 34]. N (ω,ψ)
(1a)
In the TL procedure, there is one critical decision to make: 1 2
Which layer(s) to re-train? This is an important question, consid = ∇ ω − mf cos (mf x) + nf cos (nf y) −rω,
Re 􏽼��􏽻􏽺��􏽽
ering that the goal of TL is to find the best-performing TLNN given f (x,y)

the constraint imposed by the limited availability of re-training

∇2 ψ = −ω, (1b)
samples from the target system. Finding the best layer(s) to re-train
via trial-and-error can become intractable for deep NNs, given where ψ is the stream-function, ω is the vorticity, and N (ω, ψ) is
that hyperparameter tuning and a priori (offline) and a posteriori the advection term. r is the linear drag coefficient and f (x, y) is a
(online) tests would be needed for each trial (i.e. a combination time-independent external forcing at wavenumbers mf and nf.
of re-trained layers). So far, all of the aforementioned studies us This system, with different combinations of f and r, is a fitting
ing TL for turbulence or weather/climate modeling have followed prototype for a variety of large-scale geophysical and environ
the conventional wisdom from the ML community [16, 36, 37], mental flows and has been widely used to test novel techniques
which is to re-train the deepest, i.e. near the output, layers (or including data-driven SGS closures [40, 21, 7, 41, 42, 39].
have re-trained all layers or most layers in an ad-hoc fashion). For direct numerical simulations (DNS), Eqs. 1a–1b are solved
The idea here, mainly developed based on experiments and ana using a pseudo-spectral solver with high resolution (NDNS colloca
lyses using static images and classification tasks, is that the shal tion grid points in each direction), resolving all relevant spatio-
low layers learn general features of images while the deep layers temporal scales (see Materials and methods for the solver’s de
learn features specific to the images in a given training set [38]. tails). Filtering Eqs. 1a–1b yields equations for LES (Eqs. 2–3). In
Thus, for effective TL to an out-of-distribution set of images, these the LES equations, an SGS term, Π = N (ω, ψ) − N (ω, ψ), arises and
deepest layers are the best to re-train [16]. Following this idea of has to be explicitly represented in terms of the resolved flow
re-training, the deepest layers has yielded good results in the (ω, ψ) via an SGS closure. Here, (·) denotes filtering and coarse-
aforementioned studies on turbulence and weather/climate mod graining (see Materials and methods for details). The same
eling, e.g. to generalize to canonical flows with 10–16 times higher pseudo-spectral solver, but with a lower spatio-temporal reso
Re numbers [21]. However, given the increasing interest in using lution (e.g. NLES = NDNS /8 and a 10 × larger time step), is used to
TL, its broad applications in these areas, and the need for effective solve the LES equations (2–3). While the LES solver is computa
TL in more complex systems, the best practices and the learned tionally much cheaper, it requires an accurate closure for
physics should be understood and readily accessible. Π(ω, ψ), a long-standing challenge in every discipline of science
Specifically, the question of the best layer(s) for re-training should and engineering dealing with turbulent flows.
be more deeply investigated for the types of data and networks Here, to build data-driven closures, we train CNNs on filtered
relevant to turbulence and weather/climate modeling applica and coarse-grained DNS (FDNS) datac: The input of the CNNs is
tions. Here, we report on such an investigation for the first time. (̅
ψ, ω)
̅ and the output is Π (see Materials and methods for details).
Subel et al. | 3

Table 1. Physical and numerical parameters for the six different drive the differences and the spatial scales of the resolved and
systems, which are divided into three cases, each with a base and SGS components.
a target system
In all three cases, TL closes the out-of-distribution generalization
System Re mf nf r NDNS NLES gap: LES of the target system using a TLNN (re-trained with Mtr/10
samples) produces a KE spectrum that matches that of the target
4
Base (Case 1) 3.2 × 10 0 0 0 2,048 128
system’s FDNS. For the LES of the target system, the TLNN not
Target (Case 1) 1 × 104 4 0 0.1 1,024 128

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Base (Case 2) 1 × 103 4 0 0.1 512 128 only significantly outperforms the BNNbase, but is almost as good
Target (Case 2) 1 × 105 4 0 0.1 2,048 128 as the BNN trained on Mtr samples from the target system,
Base (Case 3) 2 × 104 25 25 0.1 1,024 128 BNNtarget (see the insets in Fig. 1).
Target (Case 3) 2 × 104 4 4 0.1 1,024 128

See Fig. 1 for snapshots and some of the statistical properties of these distinctly
different flows. Impact of re-training layer(s) on accuracy
Fig. 1 shows the power of TL in closing the generalization gaps.
These results also show that in contrast to the conventional wis
By changing Re, r, mf, and nf, we have created six distinctly differ dom, the best layers to re-train are not the deepest, but rather,
ent flows, divided into three cases, each with a base and a target the shallowest ones. For each case, we have explored all possible
system (Table 1 and Materials and methods). We have shown in combinations of 1, 2, and 3 hidden layers for re-training; i.e.
previous studies that for various setups of 2D turbulence, CNNs each layer, each pair of layers, and each 3-layer combination.
trained on large training sets, or on small training sets with Based on the correlation coefficient of the Π terms from FDNS
physics-constraints incorporated, produce accurate and stable and TLNN, which is the most common metric for a priori (offline)
data-driven closures in a priori (offline) and a posteriori (online) tests, we have found that for Cases 2 and 3, re-training layer 2
tests [21, 39]. These CNN-based closures were found to accurately alone is enough to get the best performance. For Case 1, re-
capture both diffusion and backscattering, and to outperform training layers 2 and 5 provides the best performance, although
widely used physics-based SGS closures such as the most of the gap can be closed by re-training layer 2 alone.
Smagorinsky, dynamic Smagorinsky, and mixed models in both To better understand the effects of “re-training layer” selection
a priori and a posteriori tests. In this paper, we focus on TL and ad in TL, Fig. 2 shows the offline and online performance of TLNNℓ as
dressing objectives (1)–(3) listed in the Introduction. a function of an individual re-trained hidden layer ℓ. In Case 1, the
offline performance of TLNNs substantially declines as deeper
layers are used for re-training (top row). As a result, TL with deep
Closing the generalization gap using est layers is completely ineffective; for example, LES with TLNN10
transfer learning is as poor as LES with BNNbase, leaving a large generalization gap in
Before attempting to explain the physics of TL, we first show that the KE spectrum for k > 10 (bottom row). In contrast, LES with
TL enables our CNN-based SGS closures to effectively generalize TLNN2 has a KE spectrum that closely matches that of the FDNS
between the base and target systems in each of the three cases. and only has a small generalization gap for k > 40 (as shown in
The first three rows of Fig. 1 demonstrate the differences in spatial Fig. 1, this gap is further closed when both layers 2 and 5 are re-
scales between each pair of base and target systems. In Case 1, the trained). Similarly, in Case 3, the offline performance of TLNNs de
base system is decaying turbulence while the target systems is clines as ℓ increases. That said, in this case, TL with even the worst
forced turbulence. From the ω and Π snapshots, their spectra, layer to re-train (ℓ = 10) is effective in closing the generalization
and the kinetic energy (KE) spectra, it is clear that the two systems gap in the online test. Still, LES with TLNN2 is slightly better
are different at both the large and small scales. As a results of than LES with TLNN10 (see the inset). In these two cases, there
these substantial differences across all scales, the LES of the tar are substantial changes in the large scales of the inputs and out
get system using a BNN trained on the base system (BNNbase) pro puts between the base and target systems (see the spectra of ω ̅ and
duces a KE spectrum that does not agree with that of the target Π in Fig. 1). The offline results show a clear deterioration of the
system’s FDNS (the truth). This indicates that the BNNbase fails performance when moving from shallow to deep layers, which
to generalize here, leading to a generalization gap that is the dif is due to the inability of the deeper layers to learn about changes
ference between the two KE spectra (most noticeable at wave in large scales during TL, as shown later.
numbers, k, larger than 10). Note that comparing the KE spectra In Case 2, the offline performance of TL is not a monotonic
of FDNS and LES is the most common measure of the a posteriori function of ℓ, though ℓ = 2 is still the best layer to re-train (ℓ = 7
(online) performance of SGS closures. is the worst), based on both offline and online results. The non-
Similar failures of the BNNbase to generalize are seen for Cases 2 monotonicity emerges because changes between the base and
and 3, leading to large generalization gaps in the KE spectra. In target systems’ ω ̅ and Π occur predominantly at smaller scales
Case 2, the base system has Re = 103 and the target system has (see their spectra in Fig. 1), which deeper layers are also able to
Re = 105. This 100 × increase in the Re number leads to the develop learn during TL. For this case, as in Case 1, there is a noticeable dif
ment of more small-scale features in the target system, and ference in the online performance of the LES with TLNNs that use
changes the spectrum of Π in both large and small scales. In the best and worst performing re-trained layers.
Case 3, the forcing of the base system is at wavenumber mf = nf = The above analysis demonstrates that a poor selection of the
25, while the target system’s forcing is at mf = nf = 4. This decrease re-training layer can lead to poor offline and/or online perform
in forcing wavenumbers results in more (less) large-scale (small- ance of the TLNN. This analysis also shows that in all three cases,
scale) structures in the resolved flow, as seen in the spectra of re-training the shallowest layers consistently yields the best-
both ω ̅ and KE. This change in forcing wavenumber also leads to performing TLNNs. This is in contrast to the conventional wisdom
more large-scale structures in Π without any noticeable change of TL, which is predominantly built on studies on classification of
in its small-scale structures. In short, Cases 1–3 represent 6 fluid static images, which often do not have a broad continuous spec
flow systems that are different in terms of both the physics that trum of spatial scales [16, 25, 43].
4 | PNAS Nexus, 2023, Vol. 2, No. 3

A spectral approach to interpreting transfer each hidden layer extract information from the activations
β,j
learning through spatial convolution, and their weight matrices Wℓ ∈
5×5
R are the main parameters that are learned during the training
Failure of deep layers to learn changes in large of a CNN.
scales during transfer learning In the second row of Fig. 3, we compare the all-channels-
To understand why different re-training layers lead to different TL averaged Fourier spectra of activations of the last hidden layer
performance, next, we conduct a spectral analysis of the CNNs in j

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
〈ĝ10 〉 from a fully trained BNNbase, TLNN2, and TLNN10 (〈 · 〉 repre
this section and the next one. The mathematical representation of sents averaging over all channels and ˆ· means Fourier transform).
CNNs is discussed in Materials and methods. Explained briefly, in j
The spectrum of 〈ĝ10 〉 from TLNN2 differs from that of the BNNbase
our CNNs, inputs u = (ω, ̅ ) are passed through 11 sequential con
̅ ψ at most wavenumbers including the small wavenumbers. This in
volutional layers to predict outputs, Π (Fig. 3). The hidden layers dicates that re-training layer 2 can account for differences in the
each have 64 channels. The output of channel j of layer ℓ, called output (Π) from the base and target flows at all scales, including
j
activation gℓ , is computed using Eq. 4: 64 kernels perform convo the large scales. In contrast, the spectra from TLNN10 are almost
j
lution on gℓ−1 of each of the 64 channels, j, and the outcome of the same as those from BNNbase at all scales (Case 1) or at large
these linear operations is sent through a ReLU nonlinear activa scales k < 10 (Cases 2 and 3). This indicates that re-training layer
j
tion function, σ. Fig. 3 shows examples of gℓ , which are 128 × 128 10 cannot account for differences in the output from the base
matrices (the size of the LES grid). Note that these 642 kernels in and target flows at large scales. Given that in all three cases there

Fig. 1. Some comparisons between the base and target systems of the three cases (rows 1–3) and the ability of TL to close the generalization gaps in a
posteriori (online) LES (row 4). Parameters of the six systems are listed in Table 1, and these cases are further described in Materials and methods. Each
case consists of a base (left column) and a target (right column) system. The first and second rows show, respectively, the DNS snapshots of one of the
inputs to the CNNs, ω, and the snapshots of the SGS terms, Π, the output of the CNNs (note that NLES = 128 for all systems). These rows visualize the
substantial differences in the length scales dominating the base and target systems in each case. To further demonstrate these differences in spatial
scales, using the entire training sets and solid blue lines for base (top of legend) and solid red lines (bottom of legend) for target systems, we show the
angle-averaged spectra of ̅ (left) and Π (right) in the third row, and the KE spectra of FDNS in the fourth row. In these panels, the horizontal axis is
� ω
􏽱��
wavenumber k = k2x + k2y , where kx and ky are the wavenumbers in x and y directions. The fourth row also shows the out-of-sample accuracy of the
NN-based closures: The KE spectra are from a posteriori LES of the target systems using SGS closures that are BNNs trained on Mtr samples from the base
systems (BNNbase, dashed blue lines) or from the target systems (BNNtarget, dashed red lines), or from the TLNN (black lines) re-trained using Mtr/10
samples (see Materials and methods for details of TL). In all three cases, there is a large generalization gap (difference between the dashed blue and solid
red lines), particularly for k > 10. In each case, TL closes this gap (black and solid red lines almost overlap for all k). Note that for the TL here, layers 2 and 5
are re-trained for Case 1, and layer 2 is re-trained for Cases 2 and 3 (see Section “Impact of re-training layer(s) on accuracy” and Fig. 2 for more
discussions).
Subel et al. | 5

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 2. Online and offline performance of TLNNs as a function of the individual re-trained layer. For each individual layer re-trained with Mtr/10 samples,
the top row shows the most common measure of a priori (offline) accuracy of a SGS model: the correlation coefficient between Π from FDNS (truth) and
from the TLNN. The vertical lines on the bar plots show uncertainty measured as the standard deviation calculated over 100 random samples from the
testing set. The bottom row shows the KE spectra of the target systems’ FDNS and the KE spectra from a posteriori (online) LES with BNNbase or TLNNℓ,
where ℓ indicates the re-trained layer. These KE spectra are calculated using five long integrations, each equivalent to 106ΔtDNS. Shading shows
uncertainty, estimated as 25th–75th percentiles of standard error calculated from partitioning each of the 5 runs into 10 sub-intervals. For each case, the
best (worst) individual layer to re-train is shown in red (blue) in both rows. The best- and worst-performing layers here are chosen based on the online
performance, i.e. how closely the KE spectrum matches that of the FDNS. Note that in Fig. 1, both layers 2 and 5 are re-trained during TL for Case 1, leading
to a better TLNN with LES’ KE spectrum matching that of the FDNS even at the highest wavenumbers. See Fig. S5 for the offline results of Case 3 with the
base and target systems switched.

are large-scale differences in the Π terms between the base and kernels allows us to meaningfully cluster them using the k-means
target flows (Fig. 1), this analysis explains why re-training layer algorithm. Fig. S2 presents the cluster centers (in Fourier space)
10 (or other deep layers) leads to ineffective TL, while re-training for ℓ = 2 and 10 for each case. This figure shows that the learned
layer 2 leads to the best TL performance. kernels are a combination of coherent low-pass filters (row 1),
j
To further understand what controls the spectra of gℓ , we have high-pass filters (row 8), as well as band-pass and Gabor filters.
examined Eq. 8, which is the analytically derived Fourier trans It should be pointed out that learning Gabor filters by CNNs has
form of Eq. 4. As discussed in Materials and methods, this analysis been reported in the past for a number of applications such as
j
shows that the Fourier spectrum of gℓ depends on the spectrum of text recognition [44]. Even more broadly, the emergence of such
j 􏽥ˆ β,j
ĝℓ−1 ∈ C128×128 , the spectra of the weight matrices W ℓ ∈ C
128×128
filters for learning multi-scale, oriented, localized features has
j
(and constant biases b̂ℓ ∈ R), as well as where linear activation been reported in the sparse coding and vision literature [45].
j
hℓ (x, y) > 0 (defined in Eq. 7). The latter is a result of the Fourier Since deep CNNs contain a very large number of parameters
transform of the ReLU activation function, the only source of non (O(106)), it is often intractable to isolate the effect of each convolu
j
linearity in the calculation of gℓ . In Fig. S1, we have compared the tion kernel for either a BNN or TLNN. Moreover, investigating the
β,j
spectra of activations from layers 2 and 10 before and after apply learned convolution kernels in physical space (Wℓ ∈ R5×5 ) does
ing the ReLU activation function. From this, we find that in all not lead to any meaningful physical understanding. Above, we
three cases, linear changes due to updating the weights substan show that examining the kernels in the spectral space
􏽥ˆ β,j 128×128
tially alter the spectra of the activations, while nonlinear changes (W ℓ ∈C ) leads to physically interpretable insight into their
only play a significant role in Case 1. These results (and further role as spectral filters. Still, due to the large number of parameters
discussions in Materials and methods) suggest that a deeper in and the impact of nonlinearities, it is currently challenging to
sight into TL might be obtained by examining the spectra of the understand the physics learned by the entire BNN. Fortunately,
􏽥ˆ β,j
weight matrices, W ℓ , and how they change from BNNbase to due to the over-parameterized nature of these deep CNNs, TL oc
TLNN, as done next. curs in the lazy training regime [46]. In this regime, significant
changes occur in only a small number of kernels, as shown below.
This opens an avenue for explaining what is learned in TL through
Spectral analysis of the kernels’ weights examining the spectra of the few kernels with the largest changes.
Before investigating how TL changes the spectra of kernels’ For each case, we quantify the change in each kernel by com
􏽥ˆ β,j
weights, let us first look at the spectra from the BNNbase of the puting the Frobenius norm of the difference between W ℓ from
ˆ β,j
􏽥
three cases. A close examination of |W ℓ | in different layers shows the BNNbase and TLNNℓ for ℓ = 2 and 10. As demonstrated in Fig.
that the learned kernels are a combination of a number of known S3, in each case and each layer, there are a few kernels with sub
spectral filters. While visualizing all the 642 kernels in each layer is stantial changes, much larger than the changes in the rest of the
futile, we realize that the similarity across the spectra of many 642 kernels. Fig. 4 shows the spectra of the four most-changed
6 | PNAS Nexus, 2023, Vol. 2, No. 3

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
j
Fig. 3. The top row shows a schematic of the CNN architecture and its governing equations. Examples of activations gℓ ∈ R128×128 of some of the layers ℓ
and channels j are shown as red shading (with σ being the ReLU nonlinear function, the values of these activations are all positive). Note that training a
β,j j
CNN means learning the convolution kernels’ weight matrices Wℓ ∈ R5×5 and biases’ constant matrices bℓ ∈ R128×128 (for hidden layers ℓ = 2…10, β ∈ {1,
2…64} and j ∈ {1, 2…64}). See Materials and methods for a detailed discussion of the CNN and its mathematical representation. In the bottom row, the
effects of re-training layer 2 versus layer 10 on the Fourier spectrum of the averaged activation of the last hidden layer (ℓ = 10) are compared (note that the
output layer ℓ = 11 has a linear activation function). The averaging is done over all channels, denoted by 〈 · 〉. Shading shows uncertainty, estimated as
25th–75th percentiles of the averaged activation spectra computed with 20 random input samples.

kernels (due to TL) in layers 2 and 10 from BNNbase and TLNNℓ. We and target systems’ outputs. Admittedly, the nonlinearity and sub
see that in all three cases, re-training layer 2 converts a few rela sequent layers after ℓ = 2 could impact the outcome of a low-pass
tively inactive kernels into clear low-pass filters (one exception is filter, but it is possible to separate out the impact of the nonlinear
the 4th most-changed kernel in Case 1, discussed later). In con ity. Fig. 4 and Fig. S1 show the impact of the ReLU nonlinearity by
trast, re-training layer 10 turns inactive or complex filters into comparing the spectrum of the activation before and after ReLU
other complex (often less coherent) filters, though some of them is applied. In Case 1, where the ReLU function plays an important
can be identified as band- or high-pass filters. The two panels on role in changing the activations’ spectra after TL, we find that in
the right further show that the kernels learned in TL act as their addition to low-pass filters, TLNN2 also learns more complex filters,
spectra suggest: the new low-pass filter learned from re-training such as the 4th most-changed kernel in Fig. 4, that impact the sign
j j
layer 2 produces activation g2 that is different from that of the of the linear activations, h2 .
BNNbase (for the same input u) only in large scales, while the most- The analyses presented so far provide answers to objectives 1-2
changed kernel from re-training layer 10 (a high-pass filter) produ from the Introduction. To address objective 3 (develop a general
j
ces activation g10 that is different from that of the BNNbase mainly framework to guide TL), we need to understand why layer 10 can
in the small scales. not learn the filters needed for the TL in these cases while layer 2
We remind the reader of the earlier discussion in this section: TL can. This question is investigated next by leveraging recently de
needs to capture changes in large scales of the output Π between veloped ideas in theoretical ML.
the base and target systems, and the inability of the re-trained layer
10 to do so is the reason for the ineffectiveness of TLNN10. Based on
the above analysis, we can now explain the reason of this ineffect
Loss landscapes: sensitivity of kernels to
iveness (and the effectiveness of layer 2): layer 10 fails to learn new
perturbations and re-training data
low-pass filters, which are essential for capturing changes in the So far, we have presented post-hoc analyses, investigating changes
large scales, especially at the end of the network right before the lin in the spectra of activations and weights, as well as the learned
ear output layer. In contrast, layer 2 is capable of learning new low- physics, after a BNNbase has been re-trained to obtain a TLNN.
pass filters to capture these changes in the large scales of the base Here, we present a non-intrusive method for gaining insight into
Subel et al. | 7

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
β,j
Fig. 4. The three left columns compare the Fourier spectra |W̃ℓ | of the four convolution kernels that have changed the most between BNNbase and TLNN2
˘ β,j
􏽥 􏽥 β,j
(top row) and TLNN10 (bottom row). The change in each kernel is quantified using the Frobenius norm ‖F (W ℓ ) − F (Wℓ )‖F , where F indicates the Fourier
transform (Eq. 5) and ˘· indicates that the weight matrix is from a TLNN (absence of ˘· in this figure means that the matrix is from a BNNbase). The two panels
j
on the right show examples of how changes in one kernel of layer 2 and one kernel of layer 10 affect the activations’ spectra of layer 10 by comparing ĝ10
ℓ
from BNNbase (solid blue) with that from the TLNN (solid red). We also show the activations before the application of ReLU nonlinearity σ with dashed
lines. Note that the inputs to the networks (u) are the same and from the target system. The top panel shows that the newly learned kernel in layer 2
substantially changes the activation in low wavenumbers (k ≤ 20) without affecting the higher wavenumbers, as expected from a low-pass filter. Here,
nonlinearity has little impact: the solid and dashed lines coincide. The bottom panel shows that the newly learned kernel in layer 10 only changes the
activation at high wavenumbers and that in this case, the ReLU nonlinearity has a contribution.

which layers of a BNNbase are the best (or worst) to re-train for a of shallow or deep layers for the BNNbase with data from the target
given target system before performing any actual re-training. system as the input. Fig. S4 presents the loss landscapes obtain us
This analysis exploits the concept of “loss landscapes” [43, 47, ing a second method (based on perturbations along the eigenvec
48] and examines, for a given CNN input u, the sensitivity of the tors of the Hessian of the loss). These loss landscapes provide
loss function L to perturbations of the weights (and biases) of insight to indicate if a layer is receptive to change when re-trained
the layer(s) to be re-trained. Training a deep CNN requires solving with new data during TL. Two important characteristics of these
a high-dimensional non-convex optimization problem, for which landscapes are their convexity and the magnitude. Notably, the
the smoothness of the loss function can be a significant factor in landscapes in row 1 (re-training layer 2, or 2 and 5) are both
the success of training. Previous studies [48, 43, 47, 49] show smooth and of much lower magnitude than those in row 2 (deep
that even one- or two-dimensional approximations of the loss layers). For Case 1, we show results for combinations of two layers
landscape can provide meaningful information about how easily as this yields better performance than re-training a single layer,
a deep neural network, such as a CNN, can be trained. In this and this also demonstrates that the method is robust beyond per
study, leveraging recent work in theoretical ML [43], we extend turbations of individual layers. This analysis indicates that these
the application of loss landscape analysis to studying TL; see shallow BNNbase layers are easier to re-train for these target sys
Materials and methods for more details and discussions about tems’ data, and that the loss function will likely reach a better op
computing the loss landscapes. timum during TL. This loss landscape analysis is consistent with
Fig. 5 (rows 1 and 2) shows the loss landscape calculated for our previous findings of TLNN2’s ability (TLNN10’s inability) to per
perturbations along two random directions in parameter space form well in these TL tasks.
8 | PNAS Nexus, 2023, Vol. 2, No. 3

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 5. The top two rows present the loss landscape L(δ1 ,δ2 ) computed from Eq. 9. In row 1, the weights and biases of layers 2 and 5 (Case 1) or 2 (Cases 2 and
3) from the BNNbase are perturbed in two random directions by amplitudes δ1 and δ2; see Materials and methods for details. Similarly, in row 2, the deepest
layers are perturbed. Row 3 shows the convergence of the training loss when individual shallow, middle, and deep layers are re-trained for TL. In all
calculations, the inputs are from the target system.

Additionally, Fig. 5 (bottom row) shows how quickly the loss de b) non-intrusive, inexpensive analysis, which can effectively guide
creases as a function of the number of epochs during re-training TL for any new problem. For (a), we examine the BNNs’ and TLNNs’
layer 2, 6, or 10 of the BNNbase using the target system’s data. activations and weights (done after re-training), revealing that the
For all three cases, TLNN2 converges the fastest. This is a direct newly learned kernels are meaningful spectral filters, consistent
consequence of the structure of the loss landscapes shown in with the physics of the base and target systems and their difference
rows 1 and 2 of Fig. 5: landscapes obtained from perturbing layer in the spectral space. To the best of our knowledge, this is the first
2 are more favorable for convergence (an absence of pathological full interpretation of CNNs’ kernels in an application for turbulence
non-convexities) as compared to the landscapes obtained from or weather/climate modeling. For (b), we introduce a novel use of
perturbing layer 10. loss landscapes, shedding light on which layers are most receptive
As a final note, we point out that the concept of “spectral bias” to learn the new filters in re-training.
[50, 51] from theoretical ML suggests that layer 2, which converges These steps connect the spectral analysis of turbulent flowsd
faster, is learning the large scales while the slow-converging layer and CNNs, and further connect them to the most recent advances
10 is learning the small scales. This is consistent with the conclu in analyzing deep NNs. The above analyses show that the shallow
sions of our earlier analyses of the weights’ spectra. est layers are the best to re-train here, and shed light on the learn
ed physics and the inner workings of TL for these three test cases.
Admittedly, some or all of these findings, in terms of learned phys
Discussion ics and best layer(s) to re-train, are likely specific to these three
In Section ‘2D turbulence: DNS and LES’, we present a number of cases, our specific NN architecture, and the SGS modeling applica
novel analysis steps, ranging from a) the most intrusive, computa tion. However, the analysis methods we introduce or employ are
tionally expensive ones to gain insight into the learned physics, to all general and can be used for any base-target systems,
Subel et al. | 9

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 6. Overview of the framework for guiding and explaining TL onto a new target system. The top row shows the steps of the TL process: acquiring a
large amount of training data from the base system and a small amount from the target system, training a BNNbase using data from the base system, and
re-training it using data from the target system to obtain a TLNN. On the bottom, we present the analyses involved in this framework, listed (left to right)
in the order of when they should be used. The arrows indicate what is needed from each step of the TL process and the corresponding analyses. Here, the
blue line represents data from the target system, the red line represents the trained BNNbase, and the orange line represents the re-trained TLNN.

applications (SGS modeling, data-driven forecasting, or blending Steps 1–2 are non-intrusive, inexpensive analyses that do not re
training sets), and most CNN architectures.e Therefore, putting quire any re-training, and will effectively guide Step 3, replacing
all these analysis steps together, below we propose a general expensive and time-consuming trial-and-error with many combi
framework for guiding and explaining TL, which we expect to nations of re-training layers. Steps 3–4 provide an explanation for
benefit a broad range of applications involving multi-scale non what is learned in TL and act to validate decisions made based on
linear dynamical systems. Steps 1–2.
The framework is shown schematically in Fig. 6. Assuming that There are a few points about this framework that need to be
we have a large number of training samples from the base system, further clarified. In general, turbulent flows have universal behav
an accurate BNNbase already trained on these samples, and a small ior in their smallest scales [52, 53] and vary in large scales due to
number of re-training samples from the target system, the frame forcing and geometry. This might seem to suggest that TL will al
work involves the following steps: ways need to learn changes in large scales between a base and a
target turbulent flow. This is not necessarily true, as even in
1. Compare the spectra of the input and output variables from Cases 1–2 here, in which the base and target flows are different
the base and target systems. The three cases studied here in forcing and Re number, there are differences in small scales
have shown that the change of spatial scales between the of Π too. Furthermore, in the broader applications of TL (e.g. in
base and target systems, particularly in the output variables, blending different datasets) and beyond just single-physics turbu
significantly impacts which layers are optimal for re-training. lent flows, there might be differences between the base and target
2. Compute the loss landscapes of the BNNbase with target sys systems at any scales. Step 1 is intended to identify these
tems’ data as various combinations of layers are chosen for differences.
re-training. Re-training layer(s) with favorable landscapes We also emphasize that currently there is no complete theoret
(smooth and small magnitudes) should be the first choices ical understanding of which layers of a CNN are better in learning
for TL. We further suggest examining the properly clustered what spatial scales. Our findings for Cases 1–3 and some other
weights’ spectra of the BNNbase to see if they have clear inter studies [43, 50] in the ML community suggest that the shallower
pretations as spectral filters. layers are better in learning large scales. If further work confirms
3. Re-train a TLNN based on the outcome of Step 2. Examine the this behavior for a variety of systems and CNN architectures, then
spectra of the activations from the re-trained layer(s) and the Steps 1–2 together would be able to even better guide TL in terms
last hidden layer to see if the differences in the spatial scales of the best layer(s) to re-train.
identified in Step 1 are learned. It should be noted that in more complex, an-isotropic, in-
4. Examine the spectra of the most-changed kernels between homogeneous systems (e.g. channel flows or ocean circulations),
BNNbase and TLNN. Investigate if the nature of the newly spectral analysis using other basis functions, such as Chebyshev
learned kernels (as spectral filters) is consistent with the out or wavelets [54, 55], might be needed. Moreover, additional mod
come of Steps 1 and 3 in terms of spatial scales that need to be ifications of the spectral analysis component of the framework
learned in TL. might be needed for some types of NN architectures, e.g. those
10 | PNAS Nexus, 2023, Vol. 2, No. 3

involving pooling layers, fully connected layers, or other activa training the CNN-based data-driven closures for Π and for testing
tion functions. Recent work in the ML literature on spectral ana their a priori (offline) and a posteriori (online) performance.
lysis of NNs, particularly on developing end-to-end analysis, For LES, we solve Eqs. 2–3 employing the same numerical solver
could be leveraged in addressing these challenges [51, 56]. used for DNS, but with coarser grid resolutions (NLES = 128 < NDNS )
Aside from items (1)–(3) in the Introduction addressed in this and larger time steps (ΔtLES = 10ΔtDNS). To represent Π, a
study, another major question about TL is how much re-training CNN-based closure that is trained on FDNS data is coupled to

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
data are needed to achieve a certain level of out-of-sample accur the LES solver.
acy for the target system. Currently, there is no theoretical frame
work to answer this question, particularly for data from
dynamical systems such as turbulent flows or the climate system.
Filtering and coarse-graining: LES equations and
However, a few recent developments in the ML literature for TL er
FDNS data
ror bounds of simple NNs (e.g. shallow or linear) could be lever
aged as the starting point [57–59], and combined with extensive Filtering Eqs. 1a–1b yields the governing equations for LES [39, 53,
empirical explorations, may provide some insight into this critical 65]:
question. ∂ω 1
+ N (ω, ψ) = ∇2 ω − f − rω
Finally, we point out that a number of recent studies have pro ∂t Re
posed improving out-of-distribution generalization via incorpor (2)
+ N (ω, ψ) − N (ω, ψ) ,
ating physics constraints into NNs (e.g. [60, 61]) or via data 􏽼��􏽻􏽺��􏽽
Π
augmentation (e.g. [62, 63, 64]). The latter approach has shown
promising results in image classification tasks, and could be po ∇2 ψ = −ω. (3)
tentially used in applications involving dynamical systems too.
In LES, only the large-scale structures (ψ and ω) are resolved using
Incorporating physics has also shown promising results for specif
a coarser grid resolution (compared to DNS). The effects of the
ic applications; however, such an approach requires the existence
structures smaller than the grid spacing are included in the un
of a physical constraint that is universal (e.g. a scaling law), other
closed SGS term Π, which requires a closure in terms of the re
wise, it could potentially deteriorate the performance of the NN.
solved flow, (ψ, ω).
However, the availability of such constraints are very limited. In
To obtain the FDNS data, we use the DNS snapshots of (ψ, ω),
contrast, TL provides a flexible framework that beyond improving
which are of size NDNS × NDNS , to compute snapshots of ψ, ω, and
out-of-distribution generalization, is also broadly useful to blend
Π (defined in Eq. 2), where (·) represents filtering and coarse-
disparate datasets for training, an important application on its
graining. The latter is needed to compute these variables on the
own. Note that the aforementioned approaches can be combined
LES grid (size: NLES × NLES ). Here, we use a Gaussian filter and
with TL to possibly reduce the amount of re-training data.
then sharp spectral cutoff coarse-graining [21, 39]. For each sys
To summarize, here we have presented the first full explan
tem, the FDNS dataset is divided into completely independent
ation of the physics learned in TL for multi-scale, nonlinear dy
training, validation, and testing sets [21, 39].
namical systems, and a novel general framework to guide and
explain TL for such systems. This framework will benefit a broad
range of applications in areas such as turbulence modeling and
weather/climate prediction. Climate change modeling, which Cases 1–3: base and target systems
deals with an inherently non-stationary system and also involves By changing Re, r, mf, and nf, we have created six distinct systems
combining various observational and model datasets, is an appli of 2D turbulence, which are grouped into three cases, each with a
cation that particularly needs TL, and can benefit from the frame base and a target system (Table 1). Snapshots of ω and Π as well as
work proposed here. the spectra of ω, ̅ Π, and KE of these systems are shown in Fig. 1 to
demonstrate the rich variety of fluid flow characteristics among
these systems, particularly between each case’s base and target
systems. Case 1 involves TL from decaying to forced 2D turbu
Materials and methods
lence. From the ω and Π snapshots as well as their spectra shown
Numerical solvers for DNS and LES in Fig. 1, it is clear that the two systems are different at both the
We have performed DNS for all six systems used in this study (see large and small scales. The significant differences across all scales
Table 1 and below). In DNS, Eqs. 1a–1b are solved using a Fourier– make this case the most challenging one, and result in the largest
Fourier pseudo-spectral solver with NDNS collocation grid points generalization gap as discussed in the main text.
and second-order Adams-Bashforth and Crank-Nicolson time- Case 2 involves TL between two forced 2D turbulence systems:
integration schemes with time step ΔtDNS for the advection and the base system has Re = 103 and the target system has a 100 ×
viscous terms, respectively. See Guan et al. [21, 39] for more de higher Reynolds number (Re = 105), making this the largest ex
tails on the solvers and these simulations. For the base system trapolation in Re using TL ever reported, to the best of our knowl
in Case 1 (decaying 2D turbulence), following earlier studies [40, edge. The increase in Re adds more small-scale features in ω ̅ (see
21], the flow is initialized randomly using a vorticity field (ωic) the spectrum), and changes the spectrum of Π in both large and
with a prescribed power spectrum. Snapshots of (ω, ψ) in this sys small scales. Case 3 involves decreasing the forcing wavenumbers
tem are obtained from 50–200τ, where τ is the initial of the system. Here, the base system has mf = nf = 25 while the tar
eddy-turn-over time: τ = 1/max (ωic). For the other five systems get system has mf = nf = 4. This decrease in forcing wavenumbers,
(forced 2D turbulence), once the randomly initialized flow reaches as expected, results in more (less) large-scale (small-scale) struc
statistical equilibrium after a long-term spin-up, we take sequen tures in the resolved flow; see the spectra of ω ̅ and KE.
tial snapshots of (ω, ψ) that are 1000ΔtDNS apart, in order to reduce Furthermore, more large-scale structures appear in Π without
the correlation between samples. We use the filtered and coarse- any noticeable change in the small-scale structures (see the
grained DNS data, referred to as FDNS data (details below), for power spectrum of Π).
Subel et al. | 11

Convolutional neural network and transfer selected (trainable layers) and the remaining weights/biases are
learning frozen (non-trainable layers). The TLNN is then re-trained using
Building on the success of our earlier work [21, 39], to develop non- standard backpropagation and the same MSE loss function with
local data-driven SGS closure for each system, we train a CNN Mtr/10 samples from the training set of the target system, updating
with input u = (ω(x,
̅ y), ψ
̅ (x, y)) to predict Π(x, y) (output). These the weights and biases of the trainable layers. The re-training con
CNNs are built entirely from 11 sequential convolution layers, 9 tinues until the loss plateaus (for TL, this happens at around 50

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
of which are hidden layers each with 642 kernels of size 5 × 5 epochs), which helps avoid overfitting. Note that based on offline
(note that these numbers are hyperparameters that have been op metrics such as the correlation coefficients for Π, we have not
timized for this application to avoid underfitting or overfitting [21, found any need for adjusting the hyperparameters such as the
39]). The outputs of a convolutional layer are called activations. learning rate or adding additional layers between training a BNN
For channel j, of layer ℓ, the equation for activation gℓ ∈
j and TLNN.
RNLES ×NLES is:
􏼠 􏼡
􏽘 β,j Spectral analysis of CNNs
j β j
gℓ (u) = σ (Wℓ ⊛ gℓ−1 (u)) + bℓ . (4) The Fourier transform operator F is defined as
β

Note that NLES = 128 for all systems (Table 1). Here, ⊛ represents ˆ· = F (·), F : R128×128 ↦−→ C128×128 . (5)
spatial convolution and σ( · ) = max(0, · ) is the ReLU activation To represent convolution as an operation in the spectral space, we
function (which is not present for the linear output layer, ℓ = 11). β,j
first note that we can extend each kernel Wℓ ∈ R5×5 to the full do
β,j j
Wℓ ∈ R5×5 is the weight matrix of a convolution kernel, and bℓ ∈ main of the input by padding it with zeros, as done in practice for
128×128
R is the regression bias, a constant matrix. We have β ∈ {1, β,j
faster training [67], to obtain W̃ℓ ∈ R128×128 . Then, the convolution
2…64} and j ∈ {1, 2…64} for all layers with two exceptions: in the in
theorem yields
put layer (ℓ = 1) β ∈ {1, 2}, and in the output layer (ℓ = 11), j = 1, as
the output is a single channel. The kernels’ weights and biases to β,j ˆ β,j
􏽥
Wℓ ⊛ gβℓ−1 = F −1 (W β
ℓ ⊙ ĝℓ−1 ), (6)
gether constitute the NN’s trainable parameters, which we col
lectively refer to as θ ∈ Rp . Note that gin = g0 = u and gout = g11 = Π. where ⊙ is element-wise multiplication.
j
A visualization of these networks as well as examples of activa Next, we define linear activation hℓ , which contains all the lin
tions in the hidden layers are presented in Fig. 3. An important dis ear operations in Eq. 4:
tinction between these CNNs and traditional CNNs is that these j
􏽘 β,j j
hℓ = (Wℓ ⊛ gβℓ−1 ) + bℓ . (7)
do not include any max-pooling layers or dense layers such that
β
they maintain the dimension of the input through all layers and
channels in the network. Our earlier work and a few other studies Despite the nonlinearity of Eq. 4 due to the ReLU function, its
have found such an architecture to lead to more accurate CNNs Fourier transform can be written analytically. Using Eqs. 6 and 7
for SGS closures [21, 39, 66]. and the linearity of the Fourier transform we obtain
We train these CNNs using the Adam optimizer and a j
􏽘 j
ĝℓ = (e−i(kx xα +ky yα ) ) ⊛ ĥℓ
mean-squared-error (MSE) loss function L. For BNNs, all their α
trainable parameters θ are randomly initialized, and each CNN is 􏼨 􏼩 (8)
􏽘 􏽘 ˆ β,j
trained for 100 epochs using Mtr = 2000 samples from the training = (e−i(kx xα +ky yα ) ) ⊛ 􏽥 ⊙ ĝβ ) + b̂j ,
(W ℓ ℓ−1 ℓ
set of the base system.f Note that even when we use Mtr samples α β

from the training set of the target system to train a CNN, we still j √��
where (xα , yα ) ∈ {(x, y) | hℓ (x, y) > 0} and i = −1. The term with sum
call it a “BNN” for convenience (e.g. in Fig. 1). Subscripts on
over α is a result of the ReLU function and involves summing over
BNNs clearly indicate which system provided the Mtr training
j
samples. grid points where hℓ > 0 (note that this term is the Fourier trans
j
To appropriately train and evaluate the networks, for each of form of the Heaviside function). Also note that bℓ is a constant ma
the six systems, we have created three independent training, val j
trix, therefore, b̂ℓ is only non-zero at kx = ky = 0 (and is real). See [50,
idation, and testing sets from a long DNS dataset. To ensure inde
51, 56] for more information and discussion about Fourier analysis
pendence, these subsets are chosen far apart and pattern
of NNs.
correlations between u and between Π of samples are computed j
Equation 8 shows that the spectrum of ĝℓ depends on the spec
and found negligible. The training set is reserved solely for the ac ˆ β,j
j 􏽥
tual training procedure, and the only metric calculated with this trum of ĝℓ−1 , the spectra of the weights W ℓ (and constant biases
j j
set is the MSE loss (during training) to assess the convergence of b̂ℓ ), and where hℓ > 0 in the physical (grid) space. With TL, the
the network parameters, θ. The validation set is used to assess weights and biases are updated, which changes their spectra as
j
both convergence and overfitting during training: Alongside the well as where hℓ > 0. Understanding the full effects of all these
j
training set, we compute the MSE loss on the validation set after changes on ĝℓ is challenging. In Fig. S1, we have examined the
each epoch to ensure that the network’s performance is continu spectra of activations of layers 2 and 10 from BNNbase, TLNN2,
ing to improve out-of-sample rather than overfitting. The testing and TLNN10 before and after applying the ReLU activation func
j j
set is used to evaluate the CNNs’ a priori performance reported tion (i.e., compare the spectra of ĥℓ and ĝℓ ). This analysis shows
j
in Figs. 2–4. Furthermore, note that the FDNS data used in Figs. that in all three cases, linear changes due to updating ĥℓ substan
1 and 2 are from the testing set of the corresponding system. No tially alter the spectra of the activations while nonlinear changes
data from LES have been used during the training of any CNN. only play a significant role in Case 1. These results and Eq. 8 sug
To perform TL from a BNN, the weights and biases of the TLNN gest that a deeper insight into TL might be obtained by investigat
ˆ β,j
􏽥
are initialized with those of the BNN. The layers to re-train are ing W ℓ and how they change from BNNbase to TLNN.
12 | PNAS Nexus, 2023, Vol. 2, No. 3

Calculating the loss landscape samples are chosen far apart to be weakly correlated, requiring
Let us represent a CNN with input u and trainable parameters θ as a long DNS dataset (two million ΔtDNS ). See Guan et al. [39] for fur
C(u, θ). The MSE loss function of this CNN is a function of the out ther discussions about the big versus small training sets.
put: L(C). The concept of loss landscape (of L) has received much
attention in recent years and is widely used to study the training Acknowledgments
phase of NNs [48, 47, 49]. Below, leveraging recent work in theoret We thank Fabrizio Falasca and Laure Zanna for insightful discus

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
ical ML [43], we compute the loss landscape to study the re-training sions. We are grateful to three anonymous reviewers for helpful
phase of NNs in order to gain insight into TL. comments and suggestions.
Suppose that θℓ ∈ R p are all the trainable parameters of a
∗
BNNbase from all layers ℓ. We define θ∗L ∈ R p as the subset of pa
rameters that are updated in TL, i.e. the weights and biases of Supplementary material
the re-trained layer(s), L. Next, we follow two methodologies for
Supplementary material is available at PNAS Nexus online.
constructing loss landscapes. In the first method, we follow Li
et al. [48] and select two random direction vectors v1 , v2 ∈ R p∗
and normalize them with the 2-norm of θ*. In the second method, Funding
we follow Yao et al. [68] and find the eigenvectors of the Hessian of
L(C) computed with respect to θ*L. The first two eigenvectors with This work was supported by an award from the ONR Young
largest positive eigenvalues are chosen as v1 and v2. Investigator Program (N00014-20-1-2722), a grant from the NSF
Next, in both methods, we perturb θ* along directions v1 and v2 CSSI program (OAC-2005123), and by the generosity of Eric and
by amplitudes δ1 and δ2, respectively (δ1, δ2 ∈ [ − 2, 2] for method 1, Wendy Schmidt by recommendation of the Schmidt Futures pro
[ − 1, 1] for method 2). Finally, we compute gram. The authors also benefited form discussions at the KITP
Program “Machine Learning and the Physics of Climate” sup
L(δ1 ,δ2 ) = L(C(utarget , [θℓ≠L θ∗L + δ1 v1 + δ2 v2 ])) (9) ported by NSF grant PHY-1748958. Computational resources
were provided by NSF XSEDE (allocation ATM170020) and
to generate a 2D approximation of the loss landscape and plot the
NCAR’s CISL (allocation URIC0004).
surface as a function of δ1 and δ2. Note that the input u is from the
target system. Loss landscapes from the first (second) method are
shown in Fig. 5 (Fig. S4).
In the context of TL, the shape of the loss landscape indicates
Authors Contributions
how receptive the re-training layers, L, are to change for the new All authors designed the research and wrote the paper. A.S., A.C.,
re-training samples from the target system. In practice, a shallow, and Y.G. contributed to the design of new analytic tools. A.S. per
convex landscape suggests that the network is in a favorable region formed the researched and analyzed the data.
of parameter space, and gradient descent will easily converge.
Deviations from this in the form of pathological non-convexities
or extremely large loss magnitudes can cause problems during Data availability
training and prevent the network from converging to a useful op The data used for this work are available at [Link]
timum. See Li et al. [48] and Krishnapriyan et al. [47] for further record/6621142. Codes used for transfer learning, testing, and
discussions on the interpretation of loss landscapes for the com analysis are available at [Link]
mon application where, in Eq. 9, u is from the base system and TL˙for˙SGS˙Models.
θ* represent parameters still changing during the epochs of
training.
References
Notes 1 Beck A, Flad D, Munz CD. 2019. Deep neural networks for data-
driven LES closure models. J Comput Phys. 398:108910.
a. Throughout this paper, we use “out-of-distribution” to indicate 2 Bolton T, Zanna L. 2019. Applications of deep learning to ocean
cases in which the training and testing datasets have different data inference and subgrid parameterization. J Adv Model Earth
distributions. Furthermore, we use “out-of-sample” for accuracy Syst. 11(1):376–399.
computed using samples from a testing set that is completely in 3 Brenowitz ND, Bretherton CS. 2018. Prognostic validation of a
dependent from the training set, but has the same distribution. neural network unified physics parameterization. Geophys Res
b. Following the turbulence and climate literature [21, 65, 69], we Lett. 45(12):6289–6298.
use the terms “a posteriori” and “online” to refer to experiments/ 4 Brunton SL, Noack BR, Koumoutsakos P. 2020. Machine learning
tests involving the data-driven closure coupled to the LES numer for fluid mechanics. Annu Rev Fluid Mech. 52:477–508.
ical solver. “a priori” and “offline” refer to experiments/tests in 5 Ham Y-G, Kim J-H, Luo J-J. 2019. Deep learning for multi-year
volving the closure (e.g. the trained CNN) alone. ENSO forecasts. Nature. 573(7775):568–572.
c. Whether the training FDNS data are from the base or target sys 6 Han J, Jentzen A, Weinan E. 2018. Solving high-dimensional par
tem or both is clearly explained for each analysis. tial differential equations using deep learning. Proc Natl Acad Sci
d. Spectral analysis has been the cornerstone of understanding tur USA. 115(34):8505–8510.
bulence physics since the pioneering work of Kolmogorov [52]. 7 Kochkov D, et al. 2021. Machine learning–accelerated computa
e. The weights’ spectra analysis might have to be further modified tional fluid dynamics. Proc Natl Acad Sci USA. 118(21):
for networks that involve dimension changes, e.g. via pooling e2101784118.
layers. See the Discussions. 8 Novati G, de Laroussilhe HL, Koumoutsakos P. 2021. Automating
f. While Mtr = 2000 might seem like a small number of training sam turbulence modelling by multi-agent reinforcement learning.
ples, we are in fact here using a big training set, because these Nat Mach Intell. 3(1):87–96.
Subel et al. | 13

9 Pathak J, et al. 2022. FourCastNet: a global data-driven high- 28 Inubushi M, Goto S. 2020. Transfer learning for nonlinear dynam
resolution weather model using adaptive Fourier neural opera ics and its application to fluid turbulence. Phys Rev E. 102(4):
tors. arXiv, arXiv:2202.11214, preprint: not peer reviewed. 043301.
10 Raissi M, Perdikaris P, Karniadakis GE. 2019. Physics-informed 29 Yousif MZ, Yu L, Lim H-C. 2021. High-fidelity reconstruction of
neural networks: a deep learning framework for solving forward turbulent flow from spatially limited data using enhanced super-
and inverse problems involving nonlinear partial differential resolution generative adversarial network. Phys Fluids. 33(12):

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
equations. J Comput Phys. 378:686–707. 125119.
11 Rasp S, Pritchard MS, Gentine P. 2018. Deep learning to represent 30 Chattopadhyay A, Pathak J, Nabizadeh E, Bhimji W, Hassanzadeh
subgrid processes in climate models. Proc Natl Acad Sci USA. 115 P. 2022. Long-term stability and generalization of observational
(39):9684–9689. ly-constrained stochastic data-driven models for geophysical
12 Schneider T, Lan S, Stuart A, Teixeira J. 2017. Earth system mod turbulence. Environ Data Sci. 2:E1.
eling 2.0: a blueprint for models that learn from observations and 31 Mondal S, Chattopadhyay A, Mukhopadhyay A, Ray A. 2021.
targeted high-resolution simulations. Geophys Res Lett. 44(24): Transfer learning of deep neural networks for predicting ther
12–396. moacoustic instabilities in combustion systems. Energy and AI.
13 Weyn JA, Durran DR, Caruana R. 2020. Improving data-driven 5:100085.
global weather prediction using deep convolutional neural net 32 Rasp S, Thuerey N. 2021. Data-driven medium-range weather
works on a cubed sphere. J Adv Model Earth Syst. 12(9): prediction with a ResNet pretrained on climate simulations: A
e2020MS002109. new model for weatherbench. J Adv Model Earth Syst. 13(2):
14 Yuval J, O’Gorman PA. 2020. Stable machine-learning param e2020MS002405.
eterization of subgrid processes for climate modeling at a range 33 Hu J, et al. 2021. Deep residual convolutional neural network
combining dropout and transfer learning for enso forecasting.
of resolutions. Nat Commun. 11(1):1–10.
Geophys Res Lett. 48(24):e2021GL093531.
15 Nagarajan V, Andreassen A, Neyshabur B. 2020. Understanding
34 Chakraborty S. 2021. Transfer learning based multi-fidelity phys
the failure modes of out-of-distribution generalization. arXiv,
ics informed deep neural network. J Comput Phys. 426:109942.
arXiv:2010.15775, preprint: not peer reviewed.
35 Karniadakis GE, et al. 2021. Physics-informed machine learning.
16 Yosinski J, Clune J, Bengio Y, Lipson H. 2014. How transferable are
Nat Rev Phys. 3(6):422–440.
features in deep neural networks? arXiv, arXiv:1411.1792, pre
36 Hussain M, Bird JJ, Faria DR. 2018. A study on CNN transfer learn
print: not peer reviewed.
ing for image classification. In: UK Workshop on Computational
17 Beucler T, et al. 2021. Enforcing analytic constraints in neural
Intelligence. Springer. p. 191–202.
networks emulating physical systems. Phys Rev Lett. 126(9):
37 Talo M, Baran Baloglu U, Yıldırım Ö, Acharya UR. 2019.
098302.
Application of deep transfer learning for automated brain abnor
18 Chattopadhyay A, Subel A, Hassanzadeh P. 2020. Data-driven
mality classification using MR images. Cogn Syst Res. 54:176–188.
super-parameterization using deep learning: experimentation
38 Zeiler MD, Fergus R. 2014. Visualizing and understanding convo
with multiscale Lorenz 96 systems and transfer learning. J Adv
lutional networks. In: European Conference on Computer Vision.
Model Earth Syst. 12(11):e2020MS002084.
Springer. p. 818–833.
19 Chung WT, Mishra AA, Ihme M. 2021. Interpretable data-driven
39 Guan Y, Subel A, Chattopadhyay A, Hassanzadeh P. 2023. Learning
methods for subgrid-scale closure in LES for transcritical LOX/
physics-constrained subgrid-scale closures in the small-data re
GCH4 combustion. Combust Flame. 239:111758.
gime for stable and accurate LES. Physica D. 443:133568.
20 Frezat H, Balarac G, Le Sommer J, Fablet R, Lguensat R. 2021.
40 Maulik R, San O, Rasheed A, Vedula P. 2019. Subgrid modelling
Physical invariance in neural networks for subgrid-scale scalar
for two-dimensional turbulence using neural networks. J Fluid
flux modeling. Phys Rev Fluids. 6(2):024607.
Mech. 858:122–144.
21 Guan Y, Chattopadhyay A, Subel A, Hassanzadeh P. 2022. Stable
41 Page J, Brenner MP, Kerswell RR. 2021. Revealing the state space
a posteriori LES of 2D turbulence using convolutional neural net of turbulence using machine learning. Phys Rev Fluids. 6(3):
works: backscattering analysis and generalization to higher Re 034402.
via transfer learning. J Comput Phys. 458:111090. 42 Pawar S, San O, Rasheed A, Vedula P. 2023. Frame invariant neur
22 Subel A, Chattopadhyay A, Guan Y, Hassanzadeh P. 2021. al network closures for Kraichnan turbulence. Physica A Stat Mech
Data-driven subgrid-scale modeling of forced Burgers turbu App. 609:128327.
lence using deep learning with generalization to higher 43 Neyshabur B, Sedghi H, Zhang C. 2021. What is being transferred
Reynolds numbers via transfer learning. Phys Fluids. 33(3):031702. in transfer learning? arXiv, arXiv:2008.11687, preprint: not peer
23 Taghizadeh S, Witherden FD, Girimaji SS. 2020. Turbulence clos reviewed.
ure modeling with data-driven techniques: physical compatibil 44 Goodfellow I, Bengio Y, Courville A. 2016. Deep learning.
ity and consistency considerations. New J Phys. 22(9):093023. Cambridge (MA): MIT Press.
24 Tan C, et al. 2018. A survey on deep transfer learning. In: 45 Olshausen BA, Field DJ. 1996. Emergence of simple-cell receptive
International Conference on Artificial Neural Networks. Springer. field properties by learning a sparse code for natural images.
p. 270–279. Nature. 381(6583):607–609.
25 Zhuang F, et al. 2020. A comprehensive survey on transfer learn 46 Chizat L, Oyallon E, Bach F. 2019. On lazy training in differenti
ing. Proc of IEEE. 109(1):43–76. able programming. Adv Neural Inf Process Syst. 32:2937–2947.
26 Goswami S, Kontolati K, Shields MD, Karniadakis GE. 2022. Deep 47 Krishnapriyan A, Gholami A, Zhe S, Kirby R, Mahoney MW. 2021.
transfer operator learning for partial differential equations Characterizing possible failure modes in physics-informed neur
under conditional shift. Nat Mach Intell. 4:1155–1164. al networks. Adv Neural Inf Process Syst. 34.
27 Guastoni L, et al. 2021. Convolutional-network models to predict 48 Li H, Xu Z, Taylor G, Studer C, Goldstein T. 2018. Visualizing the
wall-bounded turbulence from wall quantities. J Fluid Mech. 928: loss landscape of neural nets. arXiv, arXiv:1712.09913, preprint:
A27. not peer reviewed.
14 | PNAS Nexus, 2023, Vol. 2, No. 3

49 Mojgani R, Balajewicz M, Hassanzadeh P. 2023. Kolmogorov n– 59 Wu X, Manton JH, Aickelin U, Zhu J. 2022. An information-
width and Lagrangian physics-informed neural networks: a theoretic analysis for transfer learning: error bounds and appli
causality-conforming manifold for convection-dominated PDEs. cations. arXiv, arXiv:2207.05377, preprint: not peer reviewed.
Comput Methods Appl Mech Eng. 404:115810. 60 Beucler T, et al. 2021. Climate-invariant machine learning. arXiv,
50 Rahaman N, et al. 2019. On the spectral bias of neural networks. arXiv:2112.08440, preprint: not peer reviewed.
In: International Conference on Machine Learning. PMLR. 61 Kashinath K, et al. 2021. Physics-informed machine learning:

Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
p. 5301–5310. case studies for weather and climate modelling. Philos Trans R
51 Xu ZQJ, Zhang Y, Luo T. 2022. Overview frequency principle/ Soc A. 379(2194):20200093.
spectral bias in deep learning. arXiv, arXiv:2201.07395, preprint: 62 Erichson NB, et al. 2022. Noisymix: boosting robustness by
not peer reviewed. combining data augmentations, stability training, and
52 Kolmogorov A. 1941. The local structure of turbulence in incom noise injections. arXiv, arXiv:2202.01263, preprint: not peer
reviewed.
pressible viscous fluid for very large Reynolds numbers. Cr Acad.
63 Salman H, Ilyas A, Engstrom L, Kapoor A, Madry A. 2020. Do ad
Sci. URSS. 30:301–305.
versarially robust imagenet models transfer better? Adv Neural
53 Pope SB. 2001. Turbulent flows. Cambridge: Cambridge University
Inf Process Syst. 33:3533–3545.
Press.
64 Utrera F, Kravitz E, Erichson NB, Khanna R, Mahoney MW. 2020.
54 Bruna J, Zaremba W, Szlam A, LeCun Y. 2013. Spectral networks and
Adversarially-trained deep nets transfer better: illustration on
locally connected networks on graphs. arXiv, arXiv:1312.6203, pre
image classification. arXiv, arXiv:2007.05869, preprint: not peer
print: not peer reviewed.
reviewed.
55 Ha W, Singh C, Lanusse F, Upadhyayula S, Yu B. 2021. Adaptive
65 Sagaut P. 2006. Large eddy simulation for incompressible flows: an
wavelet distillation from neural networks through interpreta
introduction. New York: Springer Science & Business Media.
tions. Adv Neural Inf Process Syst. 34. 66 Zanna L, Bolton T. 2020. Data-driven equation discovery of ocean
56 Xu ZQJ, Zhang Y, Luo T, Xiao Y, Ma Z. 2019. Frequency principle: mesoscale closures. Geophys Res Lett. 47(17):e2020GL088376.
fourier analysis sheds light on deep neural networks. arXiv, 67 Mathieu M, Henaff M, LeCun Y. 2013. Fast training of convolu
arXiv:1901.06523, preprint: not peer reviewed. tional networks through FFTs. arXiv, arXiv:1312.5851, preprint:
57 Lampinen AK, Ganguli S. 2018. An analytic theory of generaliza not peer reviewed.
tion dynamics and transfer learning in deep linear networks. 68 Yao Z, Gholami A, Keutzer K, Mahoney MW. 2020. Pyhessian:
arXiv, arXiv:1809.10374, preprint: not peer reviewed. neural networks through the lens of the hessian. In: 2020 IEEE
58 Kalan MM, Fabian Z, Avestimehr S, Soltanolkotabi M. 2020. International Conference on Big Data (big data). IEEE. p. 581–590.
Minimax lower bounds for transfer learning with linear and one- 69 Frezat H, Le Sommer J, Fablet R, Balarac G, Lguensat R. 2022. A
hidden layer neural networks. Adv Neural Inf Process Syst. 33: posteriori learning for quasi-geostrophic turbulence paramet
1959–1969. rization. J Adv Model Earth Syst. 14:e2022MS003124.

Physics-Informed Neural Networks
100% (2)
Physics-Informed Neural Networks
22 pages
Journal of Computational Physics: M. Raissi, P. Perdikaris, G.E. Karniadakis
No ratings yet
Journal of Computational Physics: M. Raissi, P. Perdikaris, G.E. Karniadakis
22 pages
Physics-Informed Deep Learning for PDEs
100% (1)
Physics-Informed Deep Learning for PDEs
22 pages
Accepted Manuscript: Journal of Computational Physics
No ratings yet
Accepted Manuscript: Journal of Computational Physics
47 pages
2022 Predicting Parametric Spatiotemporal Dynamics by Multi-Resolution PDE Structure-Preserved Deep Learning
No ratings yet
2022 Predicting Parametric Spatiotemporal Dynamics by Multi-Resolution PDE Structure-Preserved Deep Learning
51 pages
DeepGraphONet for System Dynamics
No ratings yet
DeepGraphONet for System Dynamics
10 pages
Machine Learning and Climate Physics
No ratings yet
Machine Learning and Climate Physics
23 pages
A Review of Physics-Informed Machine Learning in F
No ratings yet
A Review of Physics-Informed Machine Learning in F
21 pages
Advancing Radar Nowcasting Through Deep Transfer Learning: Lei Han, Yangyang Zhao, Haonan Chen,, and V. Chandrasekar
No ratings yet
Advancing Radar Nowcasting Through Deep Transfer Learning: Lei Han, Yangyang Zhao, Haonan Chen,, and V. Chandrasekar
9 pages
Han Et Al. - 2022 - Advancing Radar Nowcasting Through Deep Transfer L
No ratings yet
Han Et Al. - 2022 - Advancing Radar Nowcasting Through Deep Transfer L
9 pages
Physics-Informed NN Failures
No ratings yet
Physics-Informed NN Failures
20 pages
Deep Learning Meets PDEs: New Insights
No ratings yet
Deep Learning Meets PDEs: New Insights
5 pages
Research Work Neural Network (Vicky Singh)
No ratings yet
Research Work Neural Network (Vicky Singh)
7 pages
Neural Operators in Material Simulations
No ratings yet
Neural Operators in Material Simulations
14 pages
Physics-Guided Machine Learning For Scientific Discovery - An Application in Simulating Lake Temperature Profiles
No ratings yet
Physics-Guided Machine Learning For Scientific Discovery - An Application in Simulating Lake Temperature Profiles
26 pages
Simple Introduction of Neural Network
No ratings yet
Simple Introduction of Neural Network
28 pages
Improving Physics-Informed Neural Networks With Meta-Learned Optimization
No ratings yet
Improving Physics-Informed Neural Networks With Meta-Learned Optimization
26 pages
Emerging Trends in Machine Learning For CFD
No ratings yet
Emerging Trends in Machine Learning For CFD
8 pages
VGG-16 Transfer Learning for Image Classification
No ratings yet
VGG-16 Transfer Learning for Image Classification
9 pages
Calculus 2 Project
No ratings yet
Calculus 2 Project
21 pages
A Feedforward Neural Network Framework For Approximating The Solutions To Nonlinear Ordinary Differential Equations
No ratings yet
A Feedforward Neural Network Framework For Approximating The Solutions To Nonlinear Ordinary Differential Equations
13 pages
Transfer Learning Using VGG-16 With Deep Convoluti
No ratings yet
Transfer Learning Using VGG-16 With Deep Convoluti
9 pages
Final CCIS Paper - 15 August 2019
No ratings yet
Final CCIS Paper - 15 August 2019
6 pages
Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty qu-已压缩
No ratings yet
Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty qu-已压缩
26 pages
ML PDF
No ratings yet
ML PDF
28 pages
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
No ratings yet
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
8 pages
Nonlinear Evolutien Equation
No ratings yet
Nonlinear Evolutien Equation
15 pages
Lecture 11 Transfer and Few-Shot Learning
No ratings yet
Lecture 11 Transfer and Few-Shot Learning
47 pages
Physics-Informed Neural Networks M. Raissi & P. Perdikaris & G.E. Karniadakis Online Version
No ratings yet
Physics-Informed Neural Networks M. Raissi & P. Perdikaris & G.E. Karniadakis Online Version
98 pages
Physics-Guided Neural Networks (PGNN) - An Application in Lake Temperature Modeling
No ratings yet
Physics-Guided Neural Networks (PGNN) - An Application in Lake Temperature Modeling
16 pages
Physics-Consistent Machine Learning: Output Projection Onto Physical Manifolds
No ratings yet
Physics-Consistent Machine Learning: Output Projection Onto Physical Manifolds
23 pages
Yinpeng Wang, Qiang Ren - Deep Learning-Based Forward Modeling and Inversion Techniques For Computational Physics Problems (2024, CRC Press) - Libgen - Li
No ratings yet
Yinpeng Wang, Qiang Ren - Deep Learning-Based Forward Modeling and Inversion Techniques For Computational Physics Problems (2024, CRC Press) - Libgen - Li
199 pages
Linearly Constrained Neural Networks
No ratings yet
Linearly Constrained Neural Networks
31 pages
Adaptive 6G Resource Management
No ratings yet
Adaptive 6G Resource Management
24 pages
Supporting Future Electrical Utilities - Infotech
No ratings yet
Supporting Future Electrical Utilities - Infotech
6 pages
Neural Networks and Dimensionality Trade-offs
No ratings yet
Neural Networks and Dimensionality Trade-offs
392 pages
Applied Sciences: Transfer Learning From Deep Neural Networks For Predicting Student Performance
No ratings yet
Applied Sciences: Transfer Learning From Deep Neural Networks For Predicting Student Performance
12 pages
Neural Networks in Education
No ratings yet
Neural Networks in Education
6 pages
16936-Article Text-20430-1-2-20210518
No ratings yet
16936-Article Text-20430-1-2-20210518
10 pages
ReviewPaper TransferLearning
No ratings yet
ReviewPaper TransferLearning
6 pages
ReviewPaper TransferLearning
No ratings yet
ReviewPaper TransferLearning
6 pages
Liquid Time-Constant Networks: Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, Radu Grosu
No ratings yet
Liquid Time-Constant Networks: Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, Radu Grosu
25 pages
2023-Physics-Informed Recurrent Neural Networks and Hyper-Parameter Optimization For Dynamic Process Systems
No ratings yet
2023-Physics-Informed Recurrent Neural Networks and Hyper-Parameter Optimization For Dynamic Process Systems
13 pages
Memoria
No ratings yet
Memoria
98 pages
Adaptive Deep Learning for Load Forecasting
No ratings yet
Adaptive Deep Learning for Load Forecasting
15 pages
Sciadv Abi8605
No ratings yet
Sciadv Abi8605
10 pages
Deepm&Mnet: Inferring The Electroconvection Multiphysics Fields Based On Operator Approximation by Neural Networks
No ratings yet
Deepm&Mnet: Inferring The Electroconvection Multiphysics Fields Based On Operator Approximation by Neural Networks
23 pages
Pino-Mbd: Physics-Informed Neural Operator For Solving Coupled Odes in Multi-Body Dynamics
No ratings yet
Pino-Mbd: Physics-Informed Neural Operator For Solving Coupled Odes in Multi-Body Dynamics
9 pages
DPINN for CFD Numerical Approximation
No ratings yet
DPINN for CFD Numerical Approximation
76 pages
Matlab Expo 2021 Physics Informed Machine Learning Using The Laws of Nature To Improve Generalized Deep Learning Models Edt
No ratings yet
Matlab Expo 2021 Physics Informed Machine Learning Using The Laws of Nature To Improve Generalized Deep Learning Models Edt
14 pages
Neural Networks in Scientific Computing
No ratings yet
Neural Networks in Scientific Computing
37 pages
Cyclone Forecasting with AI
No ratings yet
Cyclone Forecasting with AI
29 pages
JGR Atmospheres - 2022 - Muñoz Esparza - On The Application of An Observations Based Machine Learning Parameterization of
No ratings yet
JGR Atmospheres - 2022 - Muñoz Esparza - On The Application of An Observations Based Machine Learning Parameterization of
23 pages
The Numerical Solution of Linear Ordinary Differential Equations by Feedforward Neural Networks
No ratings yet
The Numerical Solution of Linear Ordinary Differential Equations by Feedforward Neural Networks
25 pages
Deep Learning Methods For Reynolds-Averaged Navier-Stokes Simulations of Airfoil Flows
No ratings yet
Deep Learning Methods For Reynolds-Averaged Navier-Stokes Simulations of Airfoil Flows
18 pages
Transfer Learning Seminar
No ratings yet
Transfer Learning Seminar
12 pages
Probability Convergence
No ratings yet
Probability Convergence
59 pages
Autonomous Inference of Complex Network Dynamics From Incomplete and Noisy Data
No ratings yet
Autonomous Inference of Complex Network Dynamics From Incomplete and Noisy Data
9 pages
Life Together Modeling The Co
No ratings yet
Life Together Modeling The Co
200 pages
Learning Sparse Nonlinear Dynamics Via Mixed-Integer Optimization
No ratings yet
Learning Sparse Nonlinear Dynamics Via Mixed-Integer Optimization
20 pages
Enoch Calendar: Month of Nissan Overview
No ratings yet
Enoch Calendar: Month of Nissan Overview
13 pages
March 2000 North American Native Orchid Journal
No ratings yet
March 2000 North American Native Orchid Journal
79 pages
Understanding Generator Functionality
100% (1)
Understanding Generator Functionality
2 pages
Kasi Pallavi Weightloss Diet Plan (6 Weeks)
No ratings yet
Kasi Pallavi Weightloss Diet Plan (6 Weeks)
9 pages
Benchmade 2004 Consumer Catalog
No ratings yet
Benchmade 2004 Consumer Catalog
47 pages
Convection Section & Fuel Efficiency
100% (3)
Convection Section & Fuel Efficiency
25 pages
Steamship Services and Fares in Australia
No ratings yet
Steamship Services and Fares in Australia
8 pages
NOTA ADD MATHS FORM 4 Dan FORM 5-FOKUSSTUDY - BLOGSPOT
100% (3)
NOTA ADD MATHS FORM 4 Dan FORM 5-FOKUSSTUDY - BLOGSPOT
40 pages
Area of A Trapezium PDF 1
No ratings yet
Area of A Trapezium PDF 1
3 pages
Government College of Nursing Jodhpur: Practice Teaching On-Thalasemia and Haemophilia
100% (2)
Government College of Nursing Jodhpur: Practice Teaching On-Thalasemia and Haemophilia
16 pages
Maori Mythology, Folklore and History
100% (3)
Maori Mythology, Folklore and History
10 pages
0sy.301.220.22a02 LVB-220W3
No ratings yet
0sy.301.220.22a02 LVB-220W3
1 page
Math Ex 1
No ratings yet
Math Ex 1
10 pages
CCC-BMG MOON 2-2 Army of The Unseen
100% (2)
CCC-BMG MOON 2-2 Army of The Unseen
37 pages
Bottle and Breast Feeding
No ratings yet
Bottle and Breast Feeding
13 pages
The Effects of Glucosamine and Chondroitin Supplements On Joint Health in Aging Horses
No ratings yet
The Effects of Glucosamine and Chondroitin Supplements On Joint Health in Aging Horses
3 pages
Tieng Anh
No ratings yet
Tieng Anh
19 pages
Clinical Research Notes
100% (1)
Clinical Research Notes
89 pages
ĐÁP ÁN (ĐCGHKI 2024-2025 - TIẾNG ANH 9)
No ratings yet
ĐÁP ÁN (ĐCGHKI 2024-2025 - TIẾNG ANH 9)
5 pages
About Extra Terrestrial Life
No ratings yet
About Extra Terrestrial Life
16 pages
Oxygen PPT
No ratings yet
Oxygen PPT
40 pages
Performing Calculation
No ratings yet
Performing Calculation
15 pages
Protein Structure Stability and Folding 1st Edition Kenneth P. Murphy (Auth.)
No ratings yet
Protein Structure Stability and Folding 1st Edition Kenneth P. Murphy (Auth.)
497 pages
BR Epv Pres 02
No ratings yet
BR Epv Pres 02
20 pages
Model NEXA-2 Non-Elastomeric Sliding Sleeve: Descriptio
No ratings yet
Model NEXA-2 Non-Elastomeric Sliding Sleeve: Descriptio
5 pages
AD-28X-MV Intercom Installation Guide
No ratings yet
AD-28X-MV Intercom Installation Guide
20 pages
Ecp28 1VS4 A
No ratings yet
Ecp28 1VS4 A
5 pages
Five Types of Cohesion Explained
No ratings yet
Five Types of Cohesion Explained
2 pages
300-RM en
No ratings yet
300-RM en
96 pages
June 2024 (v3) QP - Paper 4 CAIE Chemistry IGCSE
No ratings yet
June 2024 (v3) QP - Paper 4 CAIE Chemistry IGCSE
16 pages

Physics of Transfer Learning

Uploaded by

Physics of Transfer Learning

Uploaded by

PNAS Nexus, 2023, 2, 1–14

Explaining the physics of transfer learning in data-driven

Introduction essential for NNs to be practically useful in many applications.

Competing Interest: The authors declare no competing interest.

the constraint imposed by the limited availability of re-training

You might also like