Bug fix to use correct rho value in cal_titau subroutines by rsarthur · Pull Request #1344 · wrf-model/WRF

rsarthur · 2020-12-23T18:58:37Z

TYPE: bug fix

KEYWORDS: rho, sfs_opt, m_opt, cal_titau

SOURCE: Robert Arthur and Jeff Mirocha (LLNL)

DESCRIPTION OF CHANGES:
Problem:
In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong.

Solution:
A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.

ISSUE: This is a follow-on from issue #1214 by @matzegoebel, which addressed the surface stress output from subroutine vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines.

LIST OF MODIFIED FILES: dyn_em/module_diffusion_em.F

TESTS CONDUCTED:
For this relatively simple fix, no tests were conducted beyond confirming that the code compiles.

RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines

…olated to the correct location on the staggered grid.

davegill · 2020-12-23T20:27:17Z

@rsarthur @dudhia @weiwangncar
This looks to be causing bit-wise differences between serial and MPI jobs:

Which comparisons are not bit-for-bit: 
    output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1
output_7:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss8_06 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss8_06 status = 1

Just looking at the code quickly, I do not see the troubles, but certainly anytime horizontal differences are involved, communicated boundaries are likely a problem.

To debug something like this, I would run a single time step with an MPI code that is built without optimization. Compare the results of 1 processor with 4 processors.

rsarthur · 2020-12-23T20:43:49Z

@davegill Thanks for letting me know, I will take a look. Do you know what the sfs_opt and m_opt values were for the test that showed the issue?

davegill · 2020-12-23T21:10:24Z

@rsarthur
The sfs_opt and km_opt settings were different for these two runs. Nothing is ever easy on a supposedly easy PR, especially right before Christmas.

Take a look at
https://github.com/davegill/SCRIPTS

You want the Namelists/weekly/em_quarter_ss and em_quarter_ss8 directories.

Inside, you will find the files namelist.input.04 (em_quarter_ss, 32-bit precision) and namelist.input_06 (em_quarter_ss8, double precision).

Make sure you turn OFF optimization for these bit-for-bit tests (clean -a, configure -d, compile em_quarter_ss). For a real*8 build, use configure -d -r8.

Let us know what you find.

dudhia · 2020-12-23T22:36:14Z

Do these tests use m_opt? Also I don't see any red flags in the code changes as it only uses rho where it was used before.

…

On Wed, Dec 23, 2020 at 2:10 PM Dave Gill ***@***.***> wrote: @rsarthur <https://github.com/rsarthur> The sfs_opt and km_opt settings were different for these two runs. Nothing is ever easy on a supposedly easy PR, especially right before Christmas. Take a look at https://github.com/davegill/SCRIPTS You want the Namelists/weekly/em_quarter_ss and em_quarter_ss8 directories. Inside, you will find the files namelist.input.04 (em_quarter_ss, 32-bit precision) and namelist.input_06 (em_quarter_ss8, double precision). Make sure you turn OFF optimization for these bit-for-bit tests (clean -a, configure -d, compile em_quarter_ss). For a real*8 build, use configure -d -r8. Let us know what you find. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1344 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEIZ77BU3QD5S5KHWKE6JCDSWJME3ANCNFSM4VHKOEJA> .

davegill · 2020-12-23T22:58:11Z

@rsarthur @dudhia
I also cannot find why there are bit-wise diffs with these mods.

rsarthur · 2020-12-23T23:46:03Z

@davegill @dudhia Thanks for taking a look and for the suggestions. I may wait to run any tests until after the holidays. Will let you know when I do.

rsarthur · 2020-12-24T00:47:34Z

@davegill I was able to run the tests you suggested. I compared em_quarter_ss with 1 and 4 processors, and the same for em_quarter_ss8. I am not seeing any differences in the output for either case. I used diffwrf compiled with the respective wrf executable for the ss and ss8 cases.

It is possible that HALO_EM_PHYS_DIFFUSION should include rho, but that should have caused problems prior to this specific code change, since (as @dudhia mentioned) rho was already being used in the same way.

davegill · 2020-12-24T00:55:54Z

@rsarthur
Yes, let's pick this back up after the holidays. A sanity check is to make a trivial change (such as a space in one of your mods), to trigger another test.

davegill · 2020-12-27T17:55:35Z

@rsarthur @dudhia
Folks,
I verified that those b4b errors are indeed reproducible:

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 5e397befce349b39968da8fb8bc8d69e972f8df0, requested by: rsarthur for PR: https://github.com/wrf-model/WRF/pull/1344. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 166           164         0
    Number of Comparisons  : 105           104         2

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1
output_7:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss8_06 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss8_06 status = 1

rsarthur · 2021-01-04T17:41:00Z

Hi @davegill and @dudhia,
I am back online but am not quite sure how to proceed, especially since I was unable to reproduce the difference on my end. Please let me know if you have any other suggestions.

davegill · 2021-01-08T01:51:43Z

@rsarthur
I reproduced these diffs from inside of a docker container.

For some docker assistance, take a look at https://github.com/davegill/wrf-coop/blob/master/README_user.md
Build the em_qss code with MPI (./compile em_quarter_ss), using namelist.input from /wrf/Namelists/weekly/em_quarter_ss/namelist.input.04
Compare 1 processor results with 4 processor results, using a tool such as WRF/external/io_netcdf/diffwrf.
Once you see which fields are different, then use ncview to try to identify a pattern: boundary, initialization, communication, etc.

rsarthur · 2021-01-08T02:03:44Z

@davegill
I believe I did this based on your suggestions above, for both em_quarter_ss and em_quarter_ss8. diffwrf showed no differences between the 1 and 4 proc runs.

I am not sure why we are getting different results. Maybe it is because I only ran for 1 timestep. Would you recommend that I try again for more than 1 timestep? Or, could you share the diffwrf output for the runs that are giving you issues? Thanks.

davegill · 2021-01-08T02:09:34Z

@rsarthur
Well then.
Let me try again. I hate these articles that get submitted to the Journal of Irreproducible Results.

rsarthur · 2021-01-08T02:11:20Z

@davegill
Ok thank you... note my edit to the previous comment, though. I only ran for 1 timestep, so that could be the issue, let me know if that changes things.

davegill · 2021-01-08T03:47:47Z

@rsarthur
I have a 1 time step namelist that shows diffs.

The diffs are in your mods (or are exposed by your mods, honestly). If I use the previous repo code that you branched from, I get bitwise identical results between differing number of MPI processes.

git checkout f311cd5e136631ebf3e

If I use your code, I get bit-wise diffs in the 4 corners, when comparing the differing numbers of MPI processes.

[wrfuser@07e8670e1170 em_quarter_ss]$ git branch
* m_opt_rho_fix
  master

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 8 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI8
 starting wrf task            4  of            8
 starting wrf task            2  of            8
 starting wrf task            6  of            8
 starting wrf task            0  of            8
 starting wrf task            1  of            8
 starting wrf task            3  of            8
 starting wrf task            7  of            8
 starting wrf task            5  of            8

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 3 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI3
 starting wrf task            1  of            3
 starting wrf task            0  of            3
 starting wrf task            2  of            3

[wrfuser@07e8670e1170 em_quarter_ss]$ ../../external/io_netcdf/diffwrf wrfout_d01_0001-01-01_00:00:00_MPI*
 Just plot  F
Diffing wrfout_d01_0001-01-01_00:00:00_MPI3 wrfout_d01_0001-01-01_00:00:00_MPI8
 Next Time 0001-01-01_00:00:00
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
 Next Time 0001-01-01_00:00:12
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
         U      1547    3   0.2350595367E+02   0.2350595368E+02   9   0.2985E-04   0.8158E-04
         V      2046    3   0.3400045178E+01   0.3400045176E+01   9   0.2973E-04   0.6012E-03
         W      3332    3   0.3461880543E-01   0.3461873113E-01   5   0.5524E-04   0.1863E-02
        PH      2129    3   0.5000753094E+03   0.5000753217E+03   7   0.7080E-03   0.3194E-04
         T       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
       THM       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
        MU        56    2   0.4388289142E+03   0.4388289142E+03  10   0.2492E-02   0.8765E-04
         P      1891    3   0.1184337757E+03   0.1184337754E+03   8   0.1062E-02   0.1777E-03
    QVAPOR       520    3   0.4564871386E-02   0.4564871385E-02  10   0.1237E-08   0.6452E-05

I generated the file to plot using only the fields that show a difference:

ncdiff --variable U,V,W,PH,T,THM,MU,P,QVAPOR  wrfout_d01_0001-01-01_00:00:00_MPI[38] diffmpi38.nc

The following diffs are after a single time step (the initial time is identical, as is usually the situation), with mpi tasks 3 vs 8. The other fields look similarly structured, though with ncview I had to rescale to get the values viewable for the specific level. The PH diff shows up at the first level up from the surface (ph=0 is a definition at the model surface).

MU

U

QVAPOR

davegill · 2021-01-08T03:51:47Z

Here's the namelist for the QSS case (obviously without the .txt extension).
namelist.input.txt

The configure option "-d" is usually required to get identical results.

davegill · 2021-01-08T03:57:08Z

@rsarthur
In your local sandbox, see if only one of the three components is the problem (maybe try 12_21, 13_31, 23_32 separately, where the original code is used for the other two of the total three changes). The issue is the corners, and this is a single time step, so a print out in one of the corners could be useful.

We were "lucky" (for some definition of lucky) to find diffs by changing MPI ranks, so only a single MPI build is required to track this down.

Keep me in the loop with questions you have.

matzegoebel · 2021-01-08T09:09:44Z

Thanks for publishing this PR. I also had it on my list but hadn't come around to do the PR. It's really strange that the check fails, since as you say you only use the same values of rho that are already used before. These values are made available by inclusion of HALO_EM_TKE_E and PERIOD_BDY_EM_PHY_BC and by the routine phy_bc in first_rk_step_part2. So I don't know what's causing the problem here.

rsarthur · 2021-01-08T20:00:33Z

@davegill I'm on it!
@matzegoebel You're welcome. Hopefully we can get this figured out soon.

rsarthur · 2021-01-09T02:08:47Z

@davegill
Using your suggestion I was able to isolate the issue to the cal_titau_12_21 subroutine. 13_31 and 23_32 appear to be fine. This now makes sense to me because 12_21 needs corner ghost points for rho. Maybe this makes the fix obvious to you? I am not quite familiar enough with WRF's MPI exchange code to know immediately what the fix is, but hopefully we are close. I'm signing off for the weekend but will check back in next week.

davegill · 2021-01-09T19:41:35Z

@rsarthur
Great! I'll take a look.

matzegoebel · 2021-01-14T05:38:12Z

I can imagine that the failing test has to do with PR #1314. There I added the corner points that were missing before in set_physical_bc3d. But anyway it's strange that the test did not fail previously.

rsarthur · 2021-02-04T21:36:24Z

@matzegoebel that makes sense. I see that PR #1314 was merged into the 4.2.2 release. @davegill, is it possible to retest this PR in the latest version of the code to see if the issue is fixed? Do you need anything from me to do that?

davegill · 2021-02-05T01:21:37Z

@rsarthur
You can grab the latest develop branch, add these mods into your new branch, push a new PR.

rsarthur · 2021-02-09T17:10:39Z

This PR has been moved to #1396 to take advantage of an update to corner point boundary conditions (#1314) included in the v4.2.2 release.

Added rhoavg to cal_titau subroutines. Previously, rho was not interp…

809c849

…olated to the correct location on the staggered grid.

davegill changed the base branch from master to release-v4.2.2 December 23, 2020 20:28

davegill added bug Dynamics ARW release-v4.2.2 labels Dec 23, 2020

Check b4b tests for QSS

5e397be

rsarthur mentioned this pull request Feb 5, 2021

Bug fix to use correct rho value in cal_titau subroutines #1396

Merged

rsarthur closed this Feb 9, 2021

rsarthur deleted the m_opt_rho_fix branch February 12, 2021 20:01

Conversation

rsarthur commented Dec 23, 2020 • edited by davegill Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davegill commented Dec 23, 2020

Uh oh!

rsarthur commented Dec 23, 2020

Uh oh!

davegill commented Dec 23, 2020

Uh oh!

dudhia commented Dec 23, 2020 via email

Uh oh!

davegill commented Dec 23, 2020

Uh oh!

rsarthur commented Dec 23, 2020

Uh oh!

rsarthur commented Dec 24, 2020

Uh oh!

davegill commented Dec 24, 2020

Uh oh!

davegill commented Dec 27, 2020

Uh oh!

rsarthur commented Jan 4, 2021

Uh oh!

davegill commented Jan 8, 2021

Uh oh!

rsarthur commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davegill commented Jan 8, 2021

Uh oh!

rsarthur commented Jan 8, 2021

Uh oh!

davegill commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davegill commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davegill commented Jan 8, 2021

Uh oh!

matzegoebel commented Jan 8, 2021

Uh oh!

rsarthur commented Jan 8, 2021

Uh oh!

rsarthur commented Jan 9, 2021

Uh oh!

davegill commented Jan 9, 2021

Uh oh!

matzegoebel commented Jan 14, 2021

Uh oh!

rsarthur commented Feb 4, 2021

Uh oh!

davegill commented Feb 5, 2021

Uh oh!

rsarthur commented Feb 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rsarthur commented Dec 23, 2020 •

edited by davegill

Loading

rsarthur commented Jan 8, 2021 •

edited

Loading

davegill commented Jan 8, 2021 •

edited

Loading

davegill commented Jan 8, 2021 •

edited

Loading