Physics of Transfer Learning
Physics of Transfer Learning
[Link]
Advance access publication 23 January 2023
Research Report
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
turbulence modeling
Adam Subela,1, Yifei Guan a
, Ashesh Chattopadhyaya and Pedram Hassanzadeh a,b,
*
a
Department of Mechanical Engineering, Rice University, Houston, TX 77005, USA
b
Department of Earth, Environmental and Planetary Sciences, Rice University, Houston, TX 77005, USA
*To whom correspondence should be addressed: Email: pedram@[Link]
1
Present address: Courant Institute of Mathematical Sciences, New York University, 10012 NY, USA
Edited By: Yannis Yortsos
Abstract
Transfer learning (TL), which enables neural networks (NNs) to generalize out-of-distribution via targeted re-training, is becoming a
powerful tool in scientific machine learning (ML) applications such as weather/climate prediction and turbulence modeling. Effective
TL requires knowing (1) how to re-train NNs? and (2) what physics are learned during TL? Here, we present novel analyses and a
framework addressing (1)–(2) for a broad range of multi-scale, nonlinear, dynamical systems. Our approach combines spectral (e.g.
Fourier) analyses of such systems with spectral analyses of convolutional NNs, revealing physical connections between the systems
and what the NN learns (a combination of low-, high-, band-pass filters and Gabor filters). Integrating these analyses, we introduce a
general framework that identifies the best re-training procedure for a given problem based on physics and NN theory. As test case, we
explain the physics of TL in subgrid-scale modeling of several setups of 2D turbulence. Furthermore, these analyses show that in
these cases, the shallowest convolution layers are the best to re-train, which is consistent with our physics-guided framework but is
against the common wisdom guiding TL in the ML literature. Our work provides a new avenue for optimal and explainable TL, and a
step toward fully explainable NNs, for wide-ranging applications in science and engineering, such as climate change modeling.
Keywords: transfer learning, neural networks, subgrid-scale parameterization, turbulence modeling, climate modeling
Significance Statement
The use of deep neural networks (NNs) in critical applications such as weather/climate prediction and turbulence modeling is growing
rapidly. Transfer learning (TL) is a technique that enhances NNs’ capabilities, e.g. enabling them to extrapolate from one system to
another. This is crucial in applications such as climate change prediction, where the system substantially evolves in time. For effective
and reliable TL, we need to (a) understand physics that is learned in TL, and (b) have a framework guiding the TL procedure. Here, we
present novel analysis techniques and a general framework for (a)–(b) applicable to a broad range of multi-scale, nonlinear dynamical
systems. This is a major step toward developing interpretable and generalizable NNs for scientific machine learning.
from a BNN that works with similar accuracy for a target system In this paper, we use CNN-based non-local SGS closure modeling
whose statistical properties could be different from those of the for LES of several setups of forced 2D turbulence as the test case. We
base system. For instance, this could be because of a change in first demonstrate the power of TL in enabling out-of-distribution
physical properties (e.g. in the context of turbulence, an increase generalization to 100 × higher Re numbers, and even more challen
in Reynolds number, Re) or in external forcing (e.g. in the context ging target flows. We further show that here, against the conven
of climate change, a higher radiative forcing due to increased tional wisdom in the ML literature, the shallowest layers are the
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
greenhouse gases). We refer to this network as a TLNN. In TL, a best to re-train. Next, we leverage the fundamentals of turbulence
(usually small) number of the layers of the BNN are re-trained, physics and recent theoretical advances in ML to
starting from their current weights, with a small number of re-
training samples from the target system (e.g. Mtr/10 or Mtr/100 sam 1. explain what is learned during TL to a different turbulent
ples). The TL procedure, if properly formulated (as discussed later), flow, which is based around changes in the convolution ker
can produce a TLNN whose out-of-sample accuracy for the target nels of the BNN after re-training to the TLNN, and these ker
system is comparable to that of the BNN, despite using only a small nels’ physical interpretation,
amount of re-training data from the target system. 2. explain why the shallowest layers, rather than the deepest
In thermo-fluid sciences and weather/climate modeling, a few ones, are the best to re-train in these setups,
studies have reported such success with TL for SGS closure mod 3. introduce a general framework to guide TL of similar systems
eling and spatio-temporal forecasting [18, 28, 22, 21, 27, 29, 26]. For based on a number of analysis steps that could be performed
example, in data-driven closure modeling with a convolutional before re-training any TLNN.
NN (CNN) for large-eddy simulation (LES) of decaying 2D turbu
lence, Guan et al. [21] showed stable and accurate a posteriori (on
While we use the SGS modeling of canonical 2D turbulence as the
line)b LES using only Mtr/100 re-training samples from a target
test case, the methods used for (1)–(2) and the framework in (3)
system that had a 16 × higher Re number. Aside from enabling
can be readily applied to any other TL applications in turbulence
generalization for one system when parameters change, TL can
or weather/climate modeling. More broadly, this framework can
also be used to effectively blend datasets of different quality and
be used for TL applications beyond SGS modeling and for any
length for training, e.g. a large, high-fidelity training set from high-
multi-scale, nonlinear, high-dimensional dynamical systems.
resolution simulations and a very small but higher-quality re-
training set from observations/experiments or much higher- 2D turbulence: DNS and LES
resolution simulations [5, 32, 31, 30]. Such an application of TL
The dimensionless governing equations of 2D turbulence in a dou
in blending large climate model outputs and small observational
bly periodic square domain are:
datasets has shown promising results in forecasting El Niño–
Southern Oscillation and daily weather [5, 32, 33]. Even further, ∂ω ∂ψ ∂ω ∂ψ ∂ω
+ −
TL has been suggested as a way to improve the training of ∂t ∂y ∂x ∂x ∂y
��������������
physics-informed NNs, a novel PDE-solving technique [35, 34]. N (ω,ψ)
(1a)
In the TL procedure, there is one critical decision to make: 1 2
Which layer(s) to re-train? This is an important question, consid = ∇ ω − mf cos (mf x) + nf cos (nf y) −rω,
Re ����������������������������������
ering that the goal of TL is to find the best-performing TLNN given f (x,y)
Table 1. Physical and numerical parameters for the six different drive the differences and the spatial scales of the resolved and
systems, which are divided into three cases, each with a base and SGS components.
a target system
In all three cases, TL closes the out-of-distribution generalization
System Re mf nf r NDNS NLES gap: LES of the target system using a TLNN (re-trained with Mtr/10
samples) produces a KE spectrum that matches that of the target
4
Base (Case 1) 3.2 × 10 0 0 0 2,048 128
system’s FDNS. For the LES of the target system, the TLNN not
Target (Case 1) 1 × 104 4 0 0.1 1,024 128
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Base (Case 2) 1 × 103 4 0 0.1 512 128 only significantly outperforms the BNNbase, but is almost as good
Target (Case 2) 1 × 105 4 0 0.1 2,048 128 as the BNN trained on Mtr samples from the target system,
Base (Case 3) 2 × 104 25 25 0.1 1,024 128 BNNtarget (see the insets in Fig. 1).
Target (Case 3) 2 × 104 4 4 0.1 1,024 128
See Fig. 1 for snapshots and some of the statistical properties of these distinctly
different flows. Impact of re-training layer(s) on accuracy
Fig. 1 shows the power of TL in closing the generalization gaps.
These results also show that in contrast to the conventional wis
By changing Re, r, mf, and nf, we have created six distinctly differ dom, the best layers to re-train are not the deepest, but rather,
ent flows, divided into three cases, each with a base and a target the shallowest ones. For each case, we have explored all possible
system (Table 1 and Materials and methods). We have shown in combinations of 1, 2, and 3 hidden layers for re-training; i.e.
previous studies that for various setups of 2D turbulence, CNNs each layer, each pair of layers, and each 3-layer combination.
trained on large training sets, or on small training sets with Based on the correlation coefficient of the Π terms from FDNS
physics-constraints incorporated, produce accurate and stable and TLNN, which is the most common metric for a priori (offline)
data-driven closures in a priori (offline) and a posteriori (online) tests, we have found that for Cases 2 and 3, re-training layer 2
tests [21, 39]. These CNN-based closures were found to accurately alone is enough to get the best performance. For Case 1, re-
capture both diffusion and backscattering, and to outperform training layers 2 and 5 provides the best performance, although
widely used physics-based SGS closures such as the most of the gap can be closed by re-training layer 2 alone.
Smagorinsky, dynamic Smagorinsky, and mixed models in both To better understand the effects of “re-training layer” selection
a priori and a posteriori tests. In this paper, we focus on TL and ad in TL, Fig. 2 shows the offline and online performance of TLNNℓ as
dressing objectives (1)–(3) listed in the Introduction. a function of an individual re-trained hidden layer ℓ. In Case 1, the
offline performance of TLNNs substantially declines as deeper
layers are used for re-training (top row). As a result, TL with deep
Closing the generalization gap using est layers is completely ineffective; for example, LES with TLNN10
transfer learning is as poor as LES with BNNbase, leaving a large generalization gap in
Before attempting to explain the physics of TL, we first show that the KE spectrum for k > 10 (bottom row). In contrast, LES with
TL enables our CNN-based SGS closures to effectively generalize TLNN2 has a KE spectrum that closely matches that of the FDNS
between the base and target systems in each of the three cases. and only has a small generalization gap for k > 40 (as shown in
The first three rows of Fig. 1 demonstrate the differences in spatial Fig. 1, this gap is further closed when both layers 2 and 5 are re-
scales between each pair of base and target systems. In Case 1, the trained). Similarly, in Case 3, the offline performance of TLNNs de
base system is decaying turbulence while the target systems is clines as ℓ increases. That said, in this case, TL with even the worst
forced turbulence. From the ω and Π snapshots, their spectra, layer to re-train (ℓ = 10) is effective in closing the generalization
and the kinetic energy (KE) spectra, it is clear that the two systems gap in the online test. Still, LES with TLNN2 is slightly better
are different at both the large and small scales. As a results of than LES with TLNN10 (see the inset). In these two cases, there
these substantial differences across all scales, the LES of the tar are substantial changes in the large scales of the inputs and out
get system using a BNN trained on the base system (BNNbase) pro puts between the base and target systems (see the spectra of ω ̅ and
duces a KE spectrum that does not agree with that of the target Π in Fig. 1). The offline results show a clear deterioration of the
system’s FDNS (the truth). This indicates that the BNNbase fails performance when moving from shallow to deep layers, which
to generalize here, leading to a generalization gap that is the dif is due to the inability of the deeper layers to learn about changes
ference between the two KE spectra (most noticeable at wave in large scales during TL, as shown later.
numbers, k, larger than 10). Note that comparing the KE spectra In Case 2, the offline performance of TL is not a monotonic
of FDNS and LES is the most common measure of the a posteriori function of ℓ, though ℓ = 2 is still the best layer to re-train (ℓ = 7
(online) performance of SGS closures. is the worst), based on both offline and online results. The non-
Similar failures of the BNNbase to generalize are seen for Cases 2 monotonicity emerges because changes between the base and
and 3, leading to large generalization gaps in the KE spectra. In target systems’ ω ̅ and Π occur predominantly at smaller scales
Case 2, the base system has Re = 103 and the target system has (see their spectra in Fig. 1), which deeper layers are also able to
Re = 105. This 100 × increase in the Re number leads to the develop learn during TL. For this case, as in Case 1, there is a noticeable dif
ment of more small-scale features in the target system, and ference in the online performance of the LES with TLNNs that use
changes the spectrum of Π in both large and small scales. In the best and worst performing re-trained layers.
Case 3, the forcing of the base system is at wavenumber mf = nf = The above analysis demonstrates that a poor selection of the
25, while the target system’s forcing is at mf = nf = 4. This decrease re-training layer can lead to poor offline and/or online perform
in forcing wavenumbers results in more (less) large-scale (small- ance of the TLNN. This analysis also shows that in all three cases,
scale) structures in the resolved flow, as seen in the spectra of re-training the shallowest layers consistently yields the best-
both ω ̅ and KE. This change in forcing wavenumber also leads to performing TLNNs. This is in contrast to the conventional wisdom
more large-scale structures in Π without any noticeable change of TL, which is predominantly built on studies on classification of
in its small-scale structures. In short, Cases 1–3 represent 6 fluid static images, which often do not have a broad continuous spec
flow systems that are different in terms of both the physics that trum of spatial scales [16, 25, 43].
4 | PNAS Nexus, 2023, Vol. 2, No. 3
A spectral approach to interpreting transfer each hidden layer extract information from the activations
β,j
learning through spatial convolution, and their weight matrices Wℓ ∈
5×5
R are the main parameters that are learned during the training
Failure of deep layers to learn changes in large of a CNN.
scales during transfer learning In the second row of Fig. 3, we compare the all-channels-
To understand why different re-training layers lead to different TL averaged Fourier spectra of activations of the last hidden layer
performance, next, we conduct a spectral analysis of the CNNs in j
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
〈ĝ10 〉 from a fully trained BNNbase, TLNN2, and TLNN10 (〈 · 〉 repre
this section and the next one. The mathematical representation of sents averaging over all channels and ˆ· means Fourier transform).
CNNs is discussed in Materials and methods. Explained briefly, in j
The spectrum of 〈ĝ10 〉 from TLNN2 differs from that of the BNNbase
our CNNs, inputs u = (ω, ̅ ) are passed through 11 sequential con
̅ ψ at most wavenumbers including the small wavenumbers. This in
volutional layers to predict outputs, Π (Fig. 3). The hidden layers dicates that re-training layer 2 can account for differences in the
each have 64 channels. The output of channel j of layer ℓ, called output (Π) from the base and target flows at all scales, including
j
activation gℓ , is computed using Eq. 4: 64 kernels perform convo the large scales. In contrast, the spectra from TLNN10 are almost
j
lution on gℓ−1 of each of the 64 channels, j, and the outcome of the same as those from BNNbase at all scales (Case 1) or at large
these linear operations is sent through a ReLU nonlinear activa scales k < 10 (Cases 2 and 3). This indicates that re-training layer
j
tion function, σ. Fig. 3 shows examples of gℓ , which are 128 × 128 10 cannot account for differences in the output from the base
matrices (the size of the LES grid). Note that these 642 kernels in and target flows at large scales. Given that in all three cases there
Fig. 1. Some comparisons between the base and target systems of the three cases (rows 1–3) and the ability of TL to close the generalization gaps in a
posteriori (online) LES (row 4). Parameters of the six systems are listed in Table 1, and these cases are further described in Materials and methods. Each
case consists of a base (left column) and a target (right column) system. The first and second rows show, respectively, the DNS snapshots of one of the
inputs to the CNNs, ω, and the snapshots of the SGS terms, Π, the output of the CNNs (note that NLES = 128 for all systems). These rows visualize the
substantial differences in the length scales dominating the base and target systems in each case. To further demonstrate these differences in spatial
scales, using the entire training sets and solid blue lines for base (top of legend) and solid red lines (bottom of legend) for target systems, we show the
angle-averaged spectra of ̅ (left) and Π (right) in the third row, and the KE spectra of FDNS in the fourth row. In these panels, the horizontal axis is
� ω
��������
wavenumber k = k2x + k2y , where kx and ky are the wavenumbers in x and y directions. The fourth row also shows the out-of-sample accuracy of the
NN-based closures: The KE spectra are from a posteriori LES of the target systems using SGS closures that are BNNs trained on Mtr samples from the base
systems (BNNbase, dashed blue lines) or from the target systems (BNNtarget, dashed red lines), or from the TLNN (black lines) re-trained using Mtr/10
samples (see Materials and methods for details of TL). In all three cases, there is a large generalization gap (difference between the dashed blue and solid
red lines), particularly for k > 10. In each case, TL closes this gap (black and solid red lines almost overlap for all k). Note that for the TL here, layers 2 and 5
are re-trained for Case 1, and layer 2 is re-trained for Cases 2 and 3 (see Section “Impact of re-training layer(s) on accuracy” and Fig. 2 for more
discussions).
Subel et al. | 5
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 2. Online and offline performance of TLNNs as a function of the individual re-trained layer. For each individual layer re-trained with Mtr/10 samples,
the top row shows the most common measure of a priori (offline) accuracy of a SGS model: the correlation coefficient between Π from FDNS (truth) and
from the TLNN. The vertical lines on the bar plots show uncertainty measured as the standard deviation calculated over 100 random samples from the
testing set. The bottom row shows the KE spectra of the target systems’ FDNS and the KE spectra from a posteriori (online) LES with BNNbase or TLNNℓ,
where ℓ indicates the re-trained layer. These KE spectra are calculated using five long integrations, each equivalent to 106ΔtDNS. Shading shows
uncertainty, estimated as 25th–75th percentiles of standard error calculated from partitioning each of the 5 runs into 10 sub-intervals. For each case, the
best (worst) individual layer to re-train is shown in red (blue) in both rows. The best- and worst-performing layers here are chosen based on the online
performance, i.e. how closely the KE spectrum matches that of the FDNS. Note that in Fig. 1, both layers 2 and 5 are re-trained during TL for Case 1, leading
to a better TLNN with LES’ KE spectrum matching that of the FDNS even at the highest wavenumbers. See Fig. S5 for the offline results of Case 3 with the
base and target systems switched.
are large-scale differences in the Π terms between the base and kernels allows us to meaningfully cluster them using the k-means
target flows (Fig. 1), this analysis explains why re-training layer algorithm. Fig. S2 presents the cluster centers (in Fourier space)
10 (or other deep layers) leads to ineffective TL, while re-training for ℓ = 2 and 10 for each case. This figure shows that the learned
layer 2 leads to the best TL performance. kernels are a combination of coherent low-pass filters (row 1),
j
To further understand what controls the spectra of gℓ , we have high-pass filters (row 8), as well as band-pass and Gabor filters.
examined Eq. 8, which is the analytically derived Fourier trans It should be pointed out that learning Gabor filters by CNNs has
form of Eq. 4. As discussed in Materials and methods, this analysis been reported in the past for a number of applications such as
j
shows that the Fourier spectrum of gℓ depends on the spectrum of text recognition [44]. Even more broadly, the emergence of such
j ˆ β,j
ĝℓ−1 ∈ C128×128 , the spectra of the weight matrices W ℓ ∈ C
128×128
filters for learning multi-scale, oriented, localized features has
j
(and constant biases b̂ℓ ∈ R), as well as where linear activation been reported in the sparse coding and vision literature [45].
j
hℓ (x, y) > 0 (defined in Eq. 7). The latter is a result of the Fourier Since deep CNNs contain a very large number of parameters
transform of the ReLU activation function, the only source of non (O(106)), it is often intractable to isolate the effect of each convolu
j
linearity in the calculation of gℓ . In Fig. S1, we have compared the tion kernel for either a BNN or TLNN. Moreover, investigating the
β,j
spectra of activations from layers 2 and 10 before and after apply learned convolution kernels in physical space (Wℓ ∈ R5×5 ) does
ing the ReLU activation function. From this, we find that in all not lead to any meaningful physical understanding. Above, we
three cases, linear changes due to updating the weights substan show that examining the kernels in the spectral space
ˆ β,j 128×128
tially alter the spectra of the activations, while nonlinear changes (W ℓ ∈C ) leads to physically interpretable insight into their
only play a significant role in Case 1. These results (and further role as spectral filters. Still, due to the large number of parameters
discussions in Materials and methods) suggest that a deeper in and the impact of nonlinearities, it is currently challenging to
sight into TL might be obtained by examining the spectra of the understand the physics learned by the entire BNN. Fortunately,
ˆ β,j
weight matrices, W ℓ , and how they change from BNNbase to due to the over-parameterized nature of these deep CNNs, TL oc
TLNN, as done next. curs in the lazy training regime [46]. In this regime, significant
changes occur in only a small number of kernels, as shown below.
This opens an avenue for explaining what is learned in TL through
Spectral analysis of the kernels’ weights examining the spectra of the few kernels with the largest changes.
Before investigating how TL changes the spectra of kernels’ For each case, we quantify the change in each kernel by com
ˆ β,j
weights, let us first look at the spectra from the BNNbase of the puting the Frobenius norm of the difference between W ℓ from
ˆ β,j
three cases. A close examination of |W ℓ | in different layers shows the BNNbase and TLNNℓ for ℓ = 2 and 10. As demonstrated in Fig.
that the learned kernels are a combination of a number of known S3, in each case and each layer, there are a few kernels with sub
spectral filters. While visualizing all the 642 kernels in each layer is stantial changes, much larger than the changes in the rest of the
futile, we realize that the similarity across the spectra of many 642 kernels. Fig. 4 shows the spectra of the four most-changed
6 | PNAS Nexus, 2023, Vol. 2, No. 3
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
j
Fig. 3. The top row shows a schematic of the CNN architecture and its governing equations. Examples of activations gℓ ∈ R128×128 of some of the layers ℓ
and channels j are shown as red shading (with σ being the ReLU nonlinear function, the values of these activations are all positive). Note that training a
β,j j
CNN means learning the convolution kernels’ weight matrices Wℓ ∈ R5×5 and biases’ constant matrices bℓ ∈ R128×128 (for hidden layers ℓ = 2…10, β ∈ {1,
2…64} and j ∈ {1, 2…64}). See Materials and methods for a detailed discussion of the CNN and its mathematical representation. In the bottom row, the
effects of re-training layer 2 versus layer 10 on the Fourier spectrum of the averaged activation of the last hidden layer (ℓ = 10) are compared (note that the
output layer ℓ = 11 has a linear activation function). The averaging is done over all channels, denoted by 〈 · 〉. Shading shows uncertainty, estimated as
25th–75th percentiles of the averaged activation spectra computed with 20 random input samples.
kernels (due to TL) in layers 2 and 10 from BNNbase and TLNNℓ. We and target systems’ outputs. Admittedly, the nonlinearity and sub
see that in all three cases, re-training layer 2 converts a few rela sequent layers after ℓ = 2 could impact the outcome of a low-pass
tively inactive kernels into clear low-pass filters (one exception is filter, but it is possible to separate out the impact of the nonlinear
the 4th most-changed kernel in Case 1, discussed later). In con ity. Fig. 4 and Fig. S1 show the impact of the ReLU nonlinearity by
trast, re-training layer 10 turns inactive or complex filters into comparing the spectrum of the activation before and after ReLU
other complex (often less coherent) filters, though some of them is applied. In Case 1, where the ReLU function plays an important
can be identified as band- or high-pass filters. The two panels on role in changing the activations’ spectra after TL, we find that in
the right further show that the kernels learned in TL act as their addition to low-pass filters, TLNN2 also learns more complex filters,
spectra suggest: the new low-pass filter learned from re-training such as the 4th most-changed kernel in Fig. 4, that impact the sign
j j
layer 2 produces activation g2 that is different from that of the of the linear activations, h2 .
BNNbase (for the same input u) only in large scales, while the most- The analyses presented so far provide answers to objectives 1-2
changed kernel from re-training layer 10 (a high-pass filter) produ from the Introduction. To address objective 3 (develop a general
j
ces activation g10 that is different from that of the BNNbase mainly framework to guide TL), we need to understand why layer 10 can
in the small scales. not learn the filters needed for the TL in these cases while layer 2
We remind the reader of the earlier discussion in this section: TL can. This question is investigated next by leveraging recently de
needs to capture changes in large scales of the output Π between veloped ideas in theoretical ML.
the base and target systems, and the inability of the re-trained layer
10 to do so is the reason for the ineffectiveness of TLNN10. Based on
the above analysis, we can now explain the reason of this ineffect
Loss landscapes: sensitivity of kernels to
iveness (and the effectiveness of layer 2): layer 10 fails to learn new
perturbations and re-training data
low-pass filters, which are essential for capturing changes in the So far, we have presented post-hoc analyses, investigating changes
large scales, especially at the end of the network right before the lin in the spectra of activations and weights, as well as the learned
ear output layer. In contrast, layer 2 is capable of learning new low- physics, after a BNNbase has been re-trained to obtain a TLNN.
pass filters to capture these changes in the large scales of the base Here, we present a non-intrusive method for gaining insight into
Subel et al. | 7
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
β,j
Fig. 4. The three left columns compare the Fourier spectra |W̃ℓ | of the four convolution kernels that have changed the most between BNNbase and TLNN2
˘ β,j
β,j
(top row) and TLNN10 (bottom row). The change in each kernel is quantified using the Frobenius norm ‖F (W ℓ ) − F (Wℓ )‖F , where F indicates the Fourier
transform (Eq. 5) and ˘· indicates that the weight matrix is from a TLNN (absence of ˘· in this figure means that the matrix is from a BNNbase). The two panels
j
on the right show examples of how changes in one kernel of layer 2 and one kernel of layer 10 affect the activations’ spectra of layer 10 by comparing ĝ10
ℓ
from BNNbase (solid blue) with that from the TLNN (solid red). We also show the activations before the application of ReLU nonlinearity σ with dashed
lines. Note that the inputs to the networks (u) are the same and from the target system. The top panel shows that the newly learned kernel in layer 2
substantially changes the activation in low wavenumbers (k ≤ 20) without affecting the higher wavenumbers, as expected from a low-pass filter. Here,
nonlinearity has little impact: the solid and dashed lines coincide. The bottom panel shows that the newly learned kernel in layer 10 only changes the
activation at high wavenumbers and that in this case, the ReLU nonlinearity has a contribution.
which layers of a BNNbase are the best (or worst) to re-train for a of shallow or deep layers for the BNNbase with data from the target
given target system before performing any actual re-training. system as the input. Fig. S4 presents the loss landscapes obtain us
This analysis exploits the concept of “loss landscapes” [43, 47, ing a second method (based on perturbations along the eigenvec
48] and examines, for a given CNN input u, the sensitivity of the tors of the Hessian of the loss). These loss landscapes provide
loss function L to perturbations of the weights (and biases) of insight to indicate if a layer is receptive to change when re-trained
the layer(s) to be re-trained. Training a deep CNN requires solving with new data during TL. Two important characteristics of these
a high-dimensional non-convex optimization problem, for which landscapes are their convexity and the magnitude. Notably, the
the smoothness of the loss function can be a significant factor in landscapes in row 1 (re-training layer 2, or 2 and 5) are both
the success of training. Previous studies [48, 43, 47, 49] show smooth and of much lower magnitude than those in row 2 (deep
that even one- or two-dimensional approximations of the loss layers). For Case 1, we show results for combinations of two layers
landscape can provide meaningful information about how easily as this yields better performance than re-training a single layer,
a deep neural network, such as a CNN, can be trained. In this and this also demonstrates that the method is robust beyond per
study, leveraging recent work in theoretical ML [43], we extend turbations of individual layers. This analysis indicates that these
the application of loss landscape analysis to studying TL; see shallow BNNbase layers are easier to re-train for these target sys
Materials and methods for more details and discussions about tems’ data, and that the loss function will likely reach a better op
computing the loss landscapes. timum during TL. This loss landscape analysis is consistent with
Fig. 5 (rows 1 and 2) shows the loss landscape calculated for our previous findings of TLNN2’s ability (TLNN10’s inability) to per
perturbations along two random directions in parameter space form well in these TL tasks.
8 | PNAS Nexus, 2023, Vol. 2, No. 3
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 5. The top two rows present the loss landscape L(δ1 ,δ2 ) computed from Eq. 9. In row 1, the weights and biases of layers 2 and 5 (Case 1) or 2 (Cases 2 and
3) from the BNNbase are perturbed in two random directions by amplitudes δ1 and δ2; see Materials and methods for details. Similarly, in row 2, the deepest
layers are perturbed. Row 3 shows the convergence of the training loss when individual shallow, middle, and deep layers are re-trained for TL. In all
calculations, the inputs are from the target system.
Additionally, Fig. 5 (bottom row) shows how quickly the loss de b) non-intrusive, inexpensive analysis, which can effectively guide
creases as a function of the number of epochs during re-training TL for any new problem. For (a), we examine the BNNs’ and TLNNs’
layer 2, 6, or 10 of the BNNbase using the target system’s data. activations and weights (done after re-training), revealing that the
For all three cases, TLNN2 converges the fastest. This is a direct newly learned kernels are meaningful spectral filters, consistent
consequence of the structure of the loss landscapes shown in with the physics of the base and target systems and their difference
rows 1 and 2 of Fig. 5: landscapes obtained from perturbing layer in the spectral space. To the best of our knowledge, this is the first
2 are more favorable for convergence (an absence of pathological full interpretation of CNNs’ kernels in an application for turbulence
non-convexities) as compared to the landscapes obtained from or weather/climate modeling. For (b), we introduce a novel use of
perturbing layer 10. loss landscapes, shedding light on which layers are most receptive
As a final note, we point out that the concept of “spectral bias” to learn the new filters in re-training.
[50, 51] from theoretical ML suggests that layer 2, which converges These steps connect the spectral analysis of turbulent flowsd
faster, is learning the large scales while the slow-converging layer and CNNs, and further connect them to the most recent advances
10 is learning the small scales. This is consistent with the conclu in analyzing deep NNs. The above analyses show that the shallow
sions of our earlier analyses of the weights’ spectra. est layers are the best to re-train here, and shed light on the learn
ed physics and the inner workings of TL for these three test cases.
Admittedly, some or all of these findings, in terms of learned phys
Discussion ics and best layer(s) to re-train, are likely specific to these three
In Section ‘2D turbulence: DNS and LES’, we present a number of cases, our specific NN architecture, and the SGS modeling applica
novel analysis steps, ranging from a) the most intrusive, computa tion. However, the analysis methods we introduce or employ are
tionally expensive ones to gain insight into the learned physics, to all general and can be used for any base-target systems,
Subel et al. | 9
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
Fig. 6. Overview of the framework for guiding and explaining TL onto a new target system. The top row shows the steps of the TL process: acquiring a
large amount of training data from the base system and a small amount from the target system, training a BNNbase using data from the base system, and
re-training it using data from the target system to obtain a TLNN. On the bottom, we present the analyses involved in this framework, listed (left to right)
in the order of when they should be used. The arrows indicate what is needed from each step of the TL process and the corresponding analyses. Here, the
blue line represents data from the target system, the red line represents the trained BNNbase, and the orange line represents the re-trained TLNN.
applications (SGS modeling, data-driven forecasting, or blending Steps 1–2 are non-intrusive, inexpensive analyses that do not re
training sets), and most CNN architectures.e Therefore, putting quire any re-training, and will effectively guide Step 3, replacing
all these analysis steps together, below we propose a general expensive and time-consuming trial-and-error with many combi
framework for guiding and explaining TL, which we expect to nations of re-training layers. Steps 3–4 provide an explanation for
benefit a broad range of applications involving multi-scale non what is learned in TL and act to validate decisions made based on
linear dynamical systems. Steps 1–2.
The framework is shown schematically in Fig. 6. Assuming that There are a few points about this framework that need to be
we have a large number of training samples from the base system, further clarified. In general, turbulent flows have universal behav
an accurate BNNbase already trained on these samples, and a small ior in their smallest scales [52, 53] and vary in large scales due to
number of re-training samples from the target system, the frame forcing and geometry. This might seem to suggest that TL will al
work involves the following steps: ways need to learn changes in large scales between a base and a
target turbulent flow. This is not necessarily true, as even in
1. Compare the spectra of the input and output variables from Cases 1–2 here, in which the base and target flows are different
the base and target systems. The three cases studied here in forcing and Re number, there are differences in small scales
have shown that the change of spatial scales between the of Π too. Furthermore, in the broader applications of TL (e.g. in
base and target systems, particularly in the output variables, blending different datasets) and beyond just single-physics turbu
significantly impacts which layers are optimal for re-training. lent flows, there might be differences between the base and target
2. Compute the loss landscapes of the BNNbase with target sys systems at any scales. Step 1 is intended to identify these
tems’ data as various combinations of layers are chosen for differences.
re-training. Re-training layer(s) with favorable landscapes We also emphasize that currently there is no complete theoret
(smooth and small magnitudes) should be the first choices ical understanding of which layers of a CNN are better in learning
for TL. We further suggest examining the properly clustered what spatial scales. Our findings for Cases 1–3 and some other
weights’ spectra of the BNNbase to see if they have clear inter studies [43, 50] in the ML community suggest that the shallower
pretations as spectral filters. layers are better in learning large scales. If further work confirms
3. Re-train a TLNN based on the outcome of Step 2. Examine the this behavior for a variety of systems and CNN architectures, then
spectra of the activations from the re-trained layer(s) and the Steps 1–2 together would be able to even better guide TL in terms
last hidden layer to see if the differences in the spatial scales of the best layer(s) to re-train.
identified in Step 1 are learned. It should be noted that in more complex, an-isotropic, in-
4. Examine the spectra of the most-changed kernels between homogeneous systems (e.g. channel flows or ocean circulations),
BNNbase and TLNN. Investigate if the nature of the newly spectral analysis using other basis functions, such as Chebyshev
learned kernels (as spectral filters) is consistent with the out or wavelets [54, 55], might be needed. Moreover, additional mod
come of Steps 1 and 3 in terms of spatial scales that need to be ifications of the spectral analysis component of the framework
learned in TL. might be needed for some types of NN architectures, e.g. those
10 | PNAS Nexus, 2023, Vol. 2, No. 3
involving pooling layers, fully connected layers, or other activa training the CNN-based data-driven closures for Π and for testing
tion functions. Recent work in the ML literature on spectral ana their a priori (offline) and a posteriori (online) performance.
lysis of NNs, particularly on developing end-to-end analysis, For LES, we solve Eqs. 2–3 employing the same numerical solver
could be leveraged in addressing these challenges [51, 56]. used for DNS, but with coarser grid resolutions (NLES = 128 < NDNS )
Aside from items (1)–(3) in the Introduction addressed in this and larger time steps (ΔtLES = 10ΔtDNS). To represent Π, a
study, another major question about TL is how much re-training CNN-based closure that is trained on FDNS data is coupled to
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
data are needed to achieve a certain level of out-of-sample accur the LES solver.
acy for the target system. Currently, there is no theoretical frame
work to answer this question, particularly for data from
dynamical systems such as turbulent flows or the climate system.
Filtering and coarse-graining: LES equations and
However, a few recent developments in the ML literature for TL er
FDNS data
ror bounds of simple NNs (e.g. shallow or linear) could be lever
aged as the starting point [57–59], and combined with extensive Filtering Eqs. 1a–1b yields the governing equations for LES [39, 53,
empirical explorations, may provide some insight into this critical 65]:
question. ∂ω 1
+ N (ω, ψ) = ∇2 ω − f − rω
Finally, we point out that a number of recent studies have pro ∂t Re
posed improving out-of-distribution generalization via incorpor (2)
+ N (ω, ψ) − N (ω, ψ) ,
ating physics constraints into NNs (e.g. [60, 61]) or via data ����������������������
Π
augmentation (e.g. [62, 63, 64]). The latter approach has shown
promising results in image classification tasks, and could be po ∇2 ψ = −ω. (3)
tentially used in applications involving dynamical systems too.
In LES, only the large-scale structures (ψ and ω) are resolved using
Incorporating physics has also shown promising results for specif
a coarser grid resolution (compared to DNS). The effects of the
ic applications; however, such an approach requires the existence
structures smaller than the grid spacing are included in the un
of a physical constraint that is universal (e.g. a scaling law), other
closed SGS term Π, which requires a closure in terms of the re
wise, it could potentially deteriorate the performance of the NN.
solved flow, (ψ, ω).
However, the availability of such constraints are very limited. In
To obtain the FDNS data, we use the DNS snapshots of (ψ, ω),
contrast, TL provides a flexible framework that beyond improving
which are of size NDNS × NDNS , to compute snapshots of ψ, ω, and
out-of-distribution generalization, is also broadly useful to blend
Π (defined in Eq. 2), where (·) represents filtering and coarse-
disparate datasets for training, an important application on its
graining. The latter is needed to compute these variables on the
own. Note that the aforementioned approaches can be combined
LES grid (size: NLES × NLES ). Here, we use a Gaussian filter and
with TL to possibly reduce the amount of re-training data.
then sharp spectral cutoff coarse-graining [21, 39]. For each sys
To summarize, here we have presented the first full explan
tem, the FDNS dataset is divided into completely independent
ation of the physics learned in TL for multi-scale, nonlinear dy
training, validation, and testing sets [21, 39].
namical systems, and a novel general framework to guide and
explain TL for such systems. This framework will benefit a broad
range of applications in areas such as turbulence modeling and
weather/climate prediction. Climate change modeling, which Cases 1–3: base and target systems
deals with an inherently non-stationary system and also involves By changing Re, r, mf, and nf, we have created six distinct systems
combining various observational and model datasets, is an appli of 2D turbulence, which are grouped into three cases, each with a
cation that particularly needs TL, and can benefit from the frame base and a target system (Table 1). Snapshots of ω and Π as well as
work proposed here. the spectra of ω, ̅ Π, and KE of these systems are shown in Fig. 1 to
demonstrate the rich variety of fluid flow characteristics among
these systems, particularly between each case’s base and target
systems. Case 1 involves TL from decaying to forced 2D turbu
Materials and methods
lence. From the ω and Π snapshots as well as their spectra shown
Numerical solvers for DNS and LES in Fig. 1, it is clear that the two systems are different at both the
We have performed DNS for all six systems used in this study (see large and small scales. The significant differences across all scales
Table 1 and below). In DNS, Eqs. 1a–1b are solved using a Fourier– make this case the most challenging one, and result in the largest
Fourier pseudo-spectral solver with NDNS collocation grid points generalization gap as discussed in the main text.
and second-order Adams-Bashforth and Crank-Nicolson time- Case 2 involves TL between two forced 2D turbulence systems:
integration schemes with time step ΔtDNS for the advection and the base system has Re = 103 and the target system has a 100 ×
viscous terms, respectively. See Guan et al. [21, 39] for more de higher Reynolds number (Re = 105), making this the largest ex
tails on the solvers and these simulations. For the base system trapolation in Re using TL ever reported, to the best of our knowl
in Case 1 (decaying 2D turbulence), following earlier studies [40, edge. The increase in Re adds more small-scale features in ω ̅ (see
21], the flow is initialized randomly using a vorticity field (ωic) the spectrum), and changes the spectrum of Π in both large and
with a prescribed power spectrum. Snapshots of (ω, ψ) in this sys small scales. Case 3 involves decreasing the forcing wavenumbers
tem are obtained from 50–200τ, where τ is the initial of the system. Here, the base system has mf = nf = 25 while the tar
eddy-turn-over time: τ = 1/max (ωic). For the other five systems get system has mf = nf = 4. This decrease in forcing wavenumbers,
(forced 2D turbulence), once the randomly initialized flow reaches as expected, results in more (less) large-scale (small-scale) struc
statistical equilibrium after a long-term spin-up, we take sequen tures in the resolved flow; see the spectra of ω ̅ and KE.
tial snapshots of (ω, ψ) that are 1000ΔtDNS apart, in order to reduce Furthermore, more large-scale structures appear in Π without
the correlation between samples. We use the filtered and coarse- any noticeable change in the small-scale structures (see the
grained DNS data, referred to as FDNS data (details below), for power spectrum of Π).
Subel et al. | 11
Convolutional neural network and transfer selected (trainable layers) and the remaining weights/biases are
learning frozen (non-trainable layers). The TLNN is then re-trained using
Building on the success of our earlier work [21, 39], to develop non- standard backpropagation and the same MSE loss function with
local data-driven SGS closure for each system, we train a CNN Mtr/10 samples from the training set of the target system, updating
with input u = (ω(x,
̅ y), ψ
̅ (x, y)) to predict Π(x, y) (output). These the weights and biases of the trainable layers. The re-training con
CNNs are built entirely from 11 sequential convolution layers, 9 tinues until the loss plateaus (for TL, this happens at around 50
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
of which are hidden layers each with 642 kernels of size 5 × 5 epochs), which helps avoid overfitting. Note that based on offline
(note that these numbers are hyperparameters that have been op metrics such as the correlation coefficients for Π, we have not
timized for this application to avoid underfitting or overfitting [21, found any need for adjusting the hyperparameters such as the
39]). The outputs of a convolutional layer are called activations. learning rate or adding additional layers between training a BNN
For channel j, of layer ℓ, the equation for activation gℓ ∈
j and TLNN.
RNLES ×NLES is:
β,j Spectral analysis of CNNs
j β j
gℓ (u) = σ (Wℓ ⊛ gℓ−1 (u)) + bℓ . (4) The Fourier transform operator F is defined as
β
Note that NLES = 128 for all systems (Table 1). Here, ⊛ represents ˆ· = F (·), F : R128×128 ↦−→ C128×128 . (5)
spatial convolution and σ( · ) = max(0, · ) is the ReLU activation To represent convolution as an operation in the spectral space, we
function (which is not present for the linear output layer, ℓ = 11). β,j
first note that we can extend each kernel Wℓ ∈ R5×5 to the full do
β,j j
Wℓ ∈ R5×5 is the weight matrix of a convolution kernel, and bℓ ∈ main of the input by padding it with zeros, as done in practice for
128×128
R is the regression bias, a constant matrix. We have β ∈ {1, β,j
faster training [67], to obtain W̃ℓ ∈ R128×128 . Then, the convolution
2…64} and j ∈ {1, 2…64} for all layers with two exceptions: in the in
theorem yields
put layer (ℓ = 1) β ∈ {1, 2}, and in the output layer (ℓ = 11), j = 1, as
the output is a single channel. The kernels’ weights and biases to β,j ˆ β,j
Wℓ ⊛ gβℓ−1 = F −1 (W β
ℓ ⊙ ĝℓ−1 ), (6)
gether constitute the NN’s trainable parameters, which we col
lectively refer to as θ ∈ Rp . Note that gin = g0 = u and gout = g11 = Π. where ⊙ is element-wise multiplication.
j
A visualization of these networks as well as examples of activa Next, we define linear activation hℓ , which contains all the lin
tions in the hidden layers are presented in Fig. 3. An important dis ear operations in Eq. 4:
tinction between these CNNs and traditional CNNs is that these j
β,j j
hℓ = (Wℓ ⊛ gβℓ−1 ) + bℓ . (7)
do not include any max-pooling layers or dense layers such that
β
they maintain the dimension of the input through all layers and
channels in the network. Our earlier work and a few other studies Despite the nonlinearity of Eq. 4 due to the ReLU function, its
have found such an architecture to lead to more accurate CNNs Fourier transform can be written analytically. Using Eqs. 6 and 7
for SGS closures [21, 39, 66]. and the linearity of the Fourier transform we obtain
We train these CNNs using the Adam optimizer and a j
j
ĝℓ = (e−i(kx xα +ky yα ) ) ⊛ ĥℓ
mean-squared-error (MSE) loss function L. For BNNs, all their α
trainable parameters θ are randomly initialized, and each CNN is (8)
ˆ β,j
trained for 100 epochs using Mtr = 2000 samples from the training = (e−i(kx xα +ky yα ) ) ⊛ ⊙ ĝβ ) + b̂j ,
(W ℓ ℓ−1 ℓ
set of the base system.f Note that even when we use Mtr samples α β
from the training set of the target system to train a CNN, we still j √����
where (xα , yα ) ∈ {(x, y) | hℓ (x, y) > 0} and i = −1. The term with sum
call it a “BNN” for convenience (e.g. in Fig. 1). Subscripts on
over α is a result of the ReLU function and involves summing over
BNNs clearly indicate which system provided the Mtr training
j
samples. grid points where hℓ > 0 (note that this term is the Fourier trans
j
To appropriately train and evaluate the networks, for each of form of the Heaviside function). Also note that bℓ is a constant ma
the six systems, we have created three independent training, val j
trix, therefore, b̂ℓ is only non-zero at kx = ky = 0 (and is real). See [50,
idation, and testing sets from a long DNS dataset. To ensure inde
51, 56] for more information and discussion about Fourier analysis
pendence, these subsets are chosen far apart and pattern
of NNs.
correlations between u and between Π of samples are computed j
Equation 8 shows that the spectrum of ĝℓ depends on the spec
and found negligible. The training set is reserved solely for the ac ˆ β,j
j
tual training procedure, and the only metric calculated with this trum of ĝℓ−1 , the spectra of the weights W ℓ (and constant biases
j j
set is the MSE loss (during training) to assess the convergence of b̂ℓ ), and where hℓ > 0 in the physical (grid) space. With TL, the
the network parameters, θ. The validation set is used to assess weights and biases are updated, which changes their spectra as
j
both convergence and overfitting during training: Alongside the well as where hℓ > 0. Understanding the full effects of all these
j
training set, we compute the MSE loss on the validation set after changes on ĝℓ is challenging. In Fig. S1, we have examined the
each epoch to ensure that the network’s performance is continu spectra of activations of layers 2 and 10 from BNNbase, TLNN2,
ing to improve out-of-sample rather than overfitting. The testing and TLNN10 before and after applying the ReLU activation func
j j
set is used to evaluate the CNNs’ a priori performance reported tion (i.e., compare the spectra of ĥℓ and ĝℓ ). This analysis shows
j
in Figs. 2–4. Furthermore, note that the FDNS data used in Figs. that in all three cases, linear changes due to updating ĥℓ substan
1 and 2 are from the testing set of the corresponding system. No tially alter the spectra of the activations while nonlinear changes
data from LES have been used during the training of any CNN. only play a significant role in Case 1. These results and Eq. 8 sug
To perform TL from a BNN, the weights and biases of the TLNN gest that a deeper insight into TL might be obtained by investigat
ˆ β,j
are initialized with those of the BNN. The layers to re-train are ing W ℓ and how they change from BNNbase to TLNN.
12 | PNAS Nexus, 2023, Vol. 2, No. 3
Calculating the loss landscape samples are chosen far apart to be weakly correlated, requiring
Let us represent a CNN with input u and trainable parameters θ as a long DNS dataset (two million ΔtDNS ). See Guan et al. [39] for fur
C(u, θ). The MSE loss function of this CNN is a function of the out ther discussions about the big versus small training sets.
put: L(C). The concept of loss landscape (of L) has received much
attention in recent years and is widely used to study the training Acknowledgments
phase of NNs [48, 47, 49]. Below, leveraging recent work in theoret We thank Fabrizio Falasca and Laure Zanna for insightful discus
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
ical ML [43], we compute the loss landscape to study the re-training sions. We are grateful to three anonymous reviewers for helpful
phase of NNs in order to gain insight into TL. comments and suggestions.
Suppose that θℓ ∈ R p are all the trainable parameters of a
∗
BNNbase from all layers ℓ. We define θ∗L ∈ R p as the subset of pa
rameters that are updated in TL, i.e. the weights and biases of Supplementary material
the re-trained layer(s), L. Next, we follow two methodologies for
Supplementary material is available at PNAS Nexus online.
constructing loss landscapes. In the first method, we follow Li
et al. [48] and select two random direction vectors v1 , v2 ∈ R p∗
and normalize them with the 2-norm of θ*. In the second method, Funding
we follow Yao et al. [68] and find the eigenvectors of the Hessian of
L(C) computed with respect to θ*L. The first two eigenvectors with This work was supported by an award from the ONR Young
largest positive eigenvalues are chosen as v1 and v2. Investigator Program (N00014-20-1-2722), a grant from the NSF
Next, in both methods, we perturb θ* along directions v1 and v2 CSSI program (OAC-2005123), and by the generosity of Eric and
by amplitudes δ1 and δ2, respectively (δ1, δ2 ∈ [ − 2, 2] for method 1, Wendy Schmidt by recommendation of the Schmidt Futures pro
[ − 1, 1] for method 2). Finally, we compute gram. The authors also benefited form discussions at the KITP
Program “Machine Learning and the Physics of Climate” sup
L(δ1 ,δ2 ) = L(C(utarget , [θℓ≠L θ∗L + δ1 v1 + δ2 v2 ])) (9) ported by NSF grant PHY-1748958. Computational resources
were provided by NSF XSEDE (allocation ATM170020) and
to generate a 2D approximation of the loss landscape and plot the
NCAR’s CISL (allocation URIC0004).
surface as a function of δ1 and δ2. Note that the input u is from the
target system. Loss landscapes from the first (second) method are
shown in Fig. 5 (Fig. S4).
In the context of TL, the shape of the loss landscape indicates
Authors Contributions
how receptive the re-training layers, L, are to change for the new All authors designed the research and wrote the paper. A.S., A.C.,
re-training samples from the target system. In practice, a shallow, and Y.G. contributed to the design of new analytic tools. A.S. per
convex landscape suggests that the network is in a favorable region formed the researched and analyzed the data.
of parameter space, and gradient descent will easily converge.
Deviations from this in the form of pathological non-convexities
or extremely large loss magnitudes can cause problems during Data availability
training and prevent the network from converging to a useful op The data used for this work are available at [Link]
timum. See Li et al. [48] and Krishnapriyan et al. [47] for further record/6621142. Codes used for transfer learning, testing, and
discussions on the interpretation of loss landscapes for the com analysis are available at [Link]
mon application where, in Eq. 9, u is from the base system and TL˙for˙SGS˙Models.
θ* represent parameters still changing during the epochs of
training.
References
Notes 1 Beck A, Flad D, Munz CD. 2019. Deep neural networks for data-
driven LES closure models. J Comput Phys. 398:108910.
a. Throughout this paper, we use “out-of-distribution” to indicate 2 Bolton T, Zanna L. 2019. Applications of deep learning to ocean
cases in which the training and testing datasets have different data inference and subgrid parameterization. J Adv Model Earth
distributions. Furthermore, we use “out-of-sample” for accuracy Syst. 11(1):376–399.
computed using samples from a testing set that is completely in 3 Brenowitz ND, Bretherton CS. 2018. Prognostic validation of a
dependent from the training set, but has the same distribution. neural network unified physics parameterization. Geophys Res
b. Following the turbulence and climate literature [21, 65, 69], we Lett. 45(12):6289–6298.
use the terms “a posteriori” and “online” to refer to experiments/ 4 Brunton SL, Noack BR, Koumoutsakos P. 2020. Machine learning
tests involving the data-driven closure coupled to the LES numer for fluid mechanics. Annu Rev Fluid Mech. 52:477–508.
ical solver. “a priori” and “offline” refer to experiments/tests in 5 Ham Y-G, Kim J-H, Luo J-J. 2019. Deep learning for multi-year
volving the closure (e.g. the trained CNN) alone. ENSO forecasts. Nature. 573(7775):568–572.
c. Whether the training FDNS data are from the base or target sys 6 Han J, Jentzen A, Weinan E. 2018. Solving high-dimensional par
tem or both is clearly explained for each analysis. tial differential equations using deep learning. Proc Natl Acad Sci
d. Spectral analysis has been the cornerstone of understanding tur USA. 115(34):8505–8510.
bulence physics since the pioneering work of Kolmogorov [52]. 7 Kochkov D, et al. 2021. Machine learning–accelerated computa
e. The weights’ spectra analysis might have to be further modified tional fluid dynamics. Proc Natl Acad Sci USA. 118(21):
for networks that involve dimension changes, e.g. via pooling e2101784118.
layers. See the Discussions. 8 Novati G, de Laroussilhe HL, Koumoutsakos P. 2021. Automating
f. While Mtr = 2000 might seem like a small number of training sam turbulence modelling by multi-agent reinforcement learning.
ples, we are in fact here using a big training set, because these Nat Mach Intell. 3(1):87–96.
Subel et al. | 13
9 Pathak J, et al. 2022. FourCastNet: a global data-driven high- 28 Inubushi M, Goto S. 2020. Transfer learning for nonlinear dynam
resolution weather model using adaptive Fourier neural opera ics and its application to fluid turbulence. Phys Rev E. 102(4):
tors. arXiv, arXiv:2202.11214, preprint: not peer reviewed. 043301.
10 Raissi M, Perdikaris P, Karniadakis GE. 2019. Physics-informed 29 Yousif MZ, Yu L, Lim H-C. 2021. High-fidelity reconstruction of
neural networks: a deep learning framework for solving forward turbulent flow from spatially limited data using enhanced super-
and inverse problems involving nonlinear partial differential resolution generative adversarial network. Phys Fluids. 33(12):
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
equations. J Comput Phys. 378:686–707. 125119.
11 Rasp S, Pritchard MS, Gentine P. 2018. Deep learning to represent 30 Chattopadhyay A, Pathak J, Nabizadeh E, Bhimji W, Hassanzadeh
subgrid processes in climate models. Proc Natl Acad Sci USA. 115 P. 2022. Long-term stability and generalization of observational
(39):9684–9689. ly-constrained stochastic data-driven models for geophysical
12 Schneider T, Lan S, Stuart A, Teixeira J. 2017. Earth system mod turbulence. Environ Data Sci. 2:E1.
eling 2.0: a blueprint for models that learn from observations and 31 Mondal S, Chattopadhyay A, Mukhopadhyay A, Ray A. 2021.
targeted high-resolution simulations. Geophys Res Lett. 44(24): Transfer learning of deep neural networks for predicting ther
12–396. moacoustic instabilities in combustion systems. Energy and AI.
13 Weyn JA, Durran DR, Caruana R. 2020. Improving data-driven 5:100085.
global weather prediction using deep convolutional neural net 32 Rasp S, Thuerey N. 2021. Data-driven medium-range weather
works on a cubed sphere. J Adv Model Earth Syst. 12(9): prediction with a ResNet pretrained on climate simulations: A
e2020MS002109. new model for weatherbench. J Adv Model Earth Syst. 13(2):
14 Yuval J, O’Gorman PA. 2020. Stable machine-learning param e2020MS002405.
eterization of subgrid processes for climate modeling at a range 33 Hu J, et al. 2021. Deep residual convolutional neural network
combining dropout and transfer learning for enso forecasting.
of resolutions. Nat Commun. 11(1):1–10.
Geophys Res Lett. 48(24):e2021GL093531.
15 Nagarajan V, Andreassen A, Neyshabur B. 2020. Understanding
34 Chakraborty S. 2021. Transfer learning based multi-fidelity phys
the failure modes of out-of-distribution generalization. arXiv,
ics informed deep neural network. J Comput Phys. 426:109942.
arXiv:2010.15775, preprint: not peer reviewed.
35 Karniadakis GE, et al. 2021. Physics-informed machine learning.
16 Yosinski J, Clune J, Bengio Y, Lipson H. 2014. How transferable are
Nat Rev Phys. 3(6):422–440.
features in deep neural networks? arXiv, arXiv:1411.1792, pre
36 Hussain M, Bird JJ, Faria DR. 2018. A study on CNN transfer learn
print: not peer reviewed.
ing for image classification. In: UK Workshop on Computational
17 Beucler T, et al. 2021. Enforcing analytic constraints in neural
Intelligence. Springer. p. 191–202.
networks emulating physical systems. Phys Rev Lett. 126(9):
37 Talo M, Baran Baloglu U, Yıldırım Ö, Acharya UR. 2019.
098302.
Application of deep transfer learning for automated brain abnor
18 Chattopadhyay A, Subel A, Hassanzadeh P. 2020. Data-driven
mality classification using MR images. Cogn Syst Res. 54:176–188.
super-parameterization using deep learning: experimentation
38 Zeiler MD, Fergus R. 2014. Visualizing and understanding convo
with multiscale Lorenz 96 systems and transfer learning. J Adv
lutional networks. In: European Conference on Computer Vision.
Model Earth Syst. 12(11):e2020MS002084.
Springer. p. 818–833.
19 Chung WT, Mishra AA, Ihme M. 2021. Interpretable data-driven
39 Guan Y, Subel A, Chattopadhyay A, Hassanzadeh P. 2023. Learning
methods for subgrid-scale closure in LES for transcritical LOX/
physics-constrained subgrid-scale closures in the small-data re
GCH4 combustion. Combust Flame. 239:111758.
gime for stable and accurate LES. Physica D. 443:133568.
20 Frezat H, Balarac G, Le Sommer J, Fablet R, Lguensat R. 2021.
40 Maulik R, San O, Rasheed A, Vedula P. 2019. Subgrid modelling
Physical invariance in neural networks for subgrid-scale scalar
for two-dimensional turbulence using neural networks. J Fluid
flux modeling. Phys Rev Fluids. 6(2):024607.
Mech. 858:122–144.
21 Guan Y, Chattopadhyay A, Subel A, Hassanzadeh P. 2022. Stable
41 Page J, Brenner MP, Kerswell RR. 2021. Revealing the state space
a posteriori LES of 2D turbulence using convolutional neural net of turbulence using machine learning. Phys Rev Fluids. 6(3):
works: backscattering analysis and generalization to higher Re 034402.
via transfer learning. J Comput Phys. 458:111090. 42 Pawar S, San O, Rasheed A, Vedula P. 2023. Frame invariant neur
22 Subel A, Chattopadhyay A, Guan Y, Hassanzadeh P. 2021. al network closures for Kraichnan turbulence. Physica A Stat Mech
Data-driven subgrid-scale modeling of forced Burgers turbu App. 609:128327.
lence using deep learning with generalization to higher 43 Neyshabur B, Sedghi H, Zhang C. 2021. What is being transferred
Reynolds numbers via transfer learning. Phys Fluids. 33(3):031702. in transfer learning? arXiv, arXiv:2008.11687, preprint: not peer
23 Taghizadeh S, Witherden FD, Girimaji SS. 2020. Turbulence clos reviewed.
ure modeling with data-driven techniques: physical compatibil 44 Goodfellow I, Bengio Y, Courville A. 2016. Deep learning.
ity and consistency considerations. New J Phys. 22(9):093023. Cambridge (MA): MIT Press.
24 Tan C, et al. 2018. A survey on deep transfer learning. In: 45 Olshausen BA, Field DJ. 1996. Emergence of simple-cell receptive
International Conference on Artificial Neural Networks. Springer. field properties by learning a sparse code for natural images.
p. 270–279. Nature. 381(6583):607–609.
25 Zhuang F, et al. 2020. A comprehensive survey on transfer learn 46 Chizat L, Oyallon E, Bach F. 2019. On lazy training in differenti
ing. Proc of IEEE. 109(1):43–76. able programming. Adv Neural Inf Process Syst. 32:2937–2947.
26 Goswami S, Kontolati K, Shields MD, Karniadakis GE. 2022. Deep 47 Krishnapriyan A, Gholami A, Zhe S, Kirby R, Mahoney MW. 2021.
transfer operator learning for partial differential equations Characterizing possible failure modes in physics-informed neur
under conditional shift. Nat Mach Intell. 4:1155–1164. al networks. Adv Neural Inf Process Syst. 34.
27 Guastoni L, et al. 2021. Convolutional-network models to predict 48 Li H, Xu Z, Taylor G, Studer C, Goldstein T. 2018. Visualizing the
wall-bounded turbulence from wall quantities. J Fluid Mech. 928: loss landscape of neural nets. arXiv, arXiv:1712.09913, preprint:
A27. not peer reviewed.
14 | PNAS Nexus, 2023, Vol. 2, No. 3
49 Mojgani R, Balajewicz M, Hassanzadeh P. 2023. Kolmogorov n– 59 Wu X, Manton JH, Aickelin U, Zhu J. 2022. An information-
width and Lagrangian physics-informed neural networks: a theoretic analysis for transfer learning: error bounds and appli
causality-conforming manifold for convection-dominated PDEs. cations. arXiv, arXiv:2207.05377, preprint: not peer reviewed.
Comput Methods Appl Mech Eng. 404:115810. 60 Beucler T, et al. 2021. Climate-invariant machine learning. arXiv,
50 Rahaman N, et al. 2019. On the spectral bias of neural networks. arXiv:2112.08440, preprint: not peer reviewed.
In: International Conference on Machine Learning. PMLR. 61 Kashinath K, et al. 2021. Physics-informed machine learning:
Downloaded from [Link] by Lib4RI - Library of Eawag, Empa, PSI, WSL user on 25 June 2025
p. 5301–5310. case studies for weather and climate modelling. Philos Trans R
51 Xu ZQJ, Zhang Y, Luo T. 2022. Overview frequency principle/ Soc A. 379(2194):20200093.
spectral bias in deep learning. arXiv, arXiv:2201.07395, preprint: 62 Erichson NB, et al. 2022. Noisymix: boosting robustness by
not peer reviewed. combining data augmentations, stability training, and
52 Kolmogorov A. 1941. The local structure of turbulence in incom noise injections. arXiv, arXiv:2202.01263, preprint: not peer
reviewed.
pressible viscous fluid for very large Reynolds numbers. Cr Acad.
63 Salman H, Ilyas A, Engstrom L, Kapoor A, Madry A. 2020. Do ad
Sci. URSS. 30:301–305.
versarially robust imagenet models transfer better? Adv Neural
53 Pope SB. 2001. Turbulent flows. Cambridge: Cambridge University
Inf Process Syst. 33:3533–3545.
Press.
64 Utrera F, Kravitz E, Erichson NB, Khanna R, Mahoney MW. 2020.
54 Bruna J, Zaremba W, Szlam A, LeCun Y. 2013. Spectral networks and
Adversarially-trained deep nets transfer better: illustration on
locally connected networks on graphs. arXiv, arXiv:1312.6203, pre
image classification. arXiv, arXiv:2007.05869, preprint: not peer
print: not peer reviewed.
reviewed.
55 Ha W, Singh C, Lanusse F, Upadhyayula S, Yu B. 2021. Adaptive
65 Sagaut P. 2006. Large eddy simulation for incompressible flows: an
wavelet distillation from neural networks through interpreta
introduction. New York: Springer Science & Business Media.
tions. Adv Neural Inf Process Syst. 34. 66 Zanna L, Bolton T. 2020. Data-driven equation discovery of ocean
56 Xu ZQJ, Zhang Y, Luo T, Xiao Y, Ma Z. 2019. Frequency principle: mesoscale closures. Geophys Res Lett. 47(17):e2020GL088376.
fourier analysis sheds light on deep neural networks. arXiv, 67 Mathieu M, Henaff M, LeCun Y. 2013. Fast training of convolu
arXiv:1901.06523, preprint: not peer reviewed. tional networks through FFTs. arXiv, arXiv:1312.5851, preprint:
57 Lampinen AK, Ganguli S. 2018. An analytic theory of generaliza not peer reviewed.
tion dynamics and transfer learning in deep linear networks. 68 Yao Z, Gholami A, Keutzer K, Mahoney MW. 2020. Pyhessian:
arXiv, arXiv:1809.10374, preprint: not peer reviewed. neural networks through the lens of the hessian. In: 2020 IEEE
58 Kalan MM, Fabian Z, Avestimehr S, Soltanolkotabi M. 2020. International Conference on Big Data (big data). IEEE. p. 581–590.
Minimax lower bounds for transfer learning with linear and one- 69 Frezat H, Le Sommer J, Fablet R, Balarac G, Lguensat R. 2022. A
hidden layer neural networks. Adv Neural Inf Process Syst. 33: posteriori learning for quasi-geostrophic turbulence paramet
1959–1969. rization. J Adv Model Earth Syst. 14:e2022MS003124.