Skip to content

Bug fix to use correct rho value in cal_titau subroutines#1344

Closed
rsarthur wants to merge 2 commits intowrf-model:release-v4.2.2from
rsarthur:m_opt_rho_fix
Closed

Bug fix to use correct rho value in cal_titau subroutines#1344
rsarthur wants to merge 2 commits intowrf-model:release-v4.2.2from
rsarthur:m_opt_rho_fix

Conversation

@rsarthur
Copy link
Copy Markdown
Contributor

@rsarthur rsarthur commented Dec 23, 2020

TYPE: bug fix

KEYWORDS: rho, sfs_opt, m_opt, cal_titau

SOURCE: Robert Arthur and Jeff Mirocha (LLNL)

DESCRIPTION OF CHANGES:
Problem:
In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong.

Solution:
A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.

ISSUE: This is a follow-on from issue #1214 by @matzegoebel, which addressed the surface stress output from subroutine vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines.

LIST OF MODIFIED FILES: dyn_em/module_diffusion_em.F

TESTS CONDUCTED:
For this relatively simple fix, no tests were conducted beyond confirming that the code compiles.

RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines

…olated to the correct location on the staggered grid.
@davegill
Copy link
Copy Markdown
Contributor

@rsarthur @dudhia @weiwangncar
This looks to be causing bit-wise differences between serial and MPI jobs:

Which comparisons are not bit-for-bit: 
    output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1
output_7:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss8_06 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss8_06 status = 1

Just looking at the code quickly, I do not see the troubles, but certainly anytime horizontal differences are involved, communicated boundaries are likely a problem.

To debug something like this, I would run a single time step with an MPI code that is built without optimization. Compare the results of 1 processor with 4 processors.

@davegill davegill changed the base branch from master to release-v4.2.2 December 23, 2020 20:28
@rsarthur
Copy link
Copy Markdown
Contributor Author

@davegill Thanks for letting me know, I will take a look. Do you know what the sfs_opt and m_opt values were for the test that showed the issue?

@davegill
Copy link
Copy Markdown
Contributor

@rsarthur
The sfs_opt and km_opt settings were different for these two runs. Nothing is ever easy on a supposedly easy PR, especially right before Christmas.

Take a look at
https://github.com/davegill/SCRIPTS

You want the Namelists/weekly/em_quarter_ss and em_quarter_ss8 directories.

Inside, you will find the files namelist.input.04 (em_quarter_ss, 32-bit precision) and namelist.input_06 (em_quarter_ss8, double precision).

Make sure you turn OFF optimization for these bit-for-bit tests (clean -a, configure -d, compile em_quarter_ss). For a real*8 build, use configure -d -r8.

Let us know what you find.

@dudhia
Copy link
Copy Markdown
Collaborator

dudhia commented Dec 23, 2020 via email

@davegill
Copy link
Copy Markdown
Contributor

@rsarthur @dudhia
I also cannot find why there are bit-wise diffs with these mods.

@rsarthur
Copy link
Copy Markdown
Contributor Author

@davegill @dudhia Thanks for taking a look and for the suggestions. I may wait to run any tests until after the holidays. Will let you know when I do.

@rsarthur
Copy link
Copy Markdown
Contributor Author

@davegill I was able to run the tests you suggested. I compared em_quarter_ss with 1 and 4 processors, and the same for em_quarter_ss8. I am not seeing any differences in the output for either case. I used diffwrf compiled with the respective wrf executable for the ss and ss8 cases.

It is possible that HALO_EM_PHYS_DIFFUSION should include rho, but that should have caused problems prior to this specific code change, since (as @dudhia mentioned) rho was already being used in the same way.

@davegill
Copy link
Copy Markdown
Contributor

@rsarthur
Yes, let's pick this back up after the holidays. A sanity check is to make a trivial change (such as a space in one of your mods), to trigger another test.

@davegill
Copy link
Copy Markdown
Contributor

@rsarthur @dudhia
Folks,
I verified that those b4b errors are indeed reproducible:

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 5e397befce349b39968da8fb8bc8d69e972f8df0, requested by: rsarthur for PR: https://github.com/wrf-model/WRF/pull/1344. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 166           164         0
    Number of Comparisons  : 105           104         2

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1
output_7:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss8_06 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss8_06 status = 1

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Jan 4, 2021

Hi @davegill and @dudhia,
I am back online but am not quite sure how to proceed, especially since I was unable to reproduce the difference on my end. Please let me know if you have any other suggestions.

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 8, 2021

@rsarthur
I reproduced these diffs from inside of a docker container.

  1. For some docker assistance, take a look at https://github.com/davegill/wrf-coop/blob/master/README_user.md
  2. Build the em_qss code with MPI (./compile em_quarter_ss), using namelist.input from /wrf/Namelists/weekly/em_quarter_ss/namelist.input.04
  3. Compare 1 processor results with 4 processor results, using a tool such as WRF/external/io_netcdf/diffwrf.
  4. Once you see which fields are different, then use ncview to try to identify a pattern: boundary, initialization, communication, etc.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Jan 8, 2021

@davegill
I believe I did this based on your suggestions above, for both em_quarter_ss and em_quarter_ss8. diffwrf showed no differences between the 1 and 4 proc runs.

I am not sure why we are getting different results. Maybe it is because I only ran for 1 timestep. Would you recommend that I try again for more than 1 timestep? Or, could you share the diffwrf output for the runs that are giving you issues? Thanks.

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 8, 2021

@rsarthur
Well then.
Let me try again. I hate these articles that get submitted to the Journal of Irreproducible Results.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Jan 8, 2021

@davegill
Ok thank you... note my edit to the previous comment, though. I only ran for 1 timestep, so that could be the issue, let me know if that changes things.

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 8, 2021

@rsarthur
I have a 1 time step namelist that shows diffs.

The diffs are in your mods (or are exposed by your mods, honestly). If I use the previous repo code that you branched from, I get bitwise identical results between differing number of MPI processes.

git checkout f311cd5e136631ebf3e

If I use your code, I get bit-wise diffs in the 4 corners, when comparing the differing numbers of MPI processes.

[wrfuser@07e8670e1170 em_quarter_ss]$ git branch
* m_opt_rho_fix
  master

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 8 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI8
 starting wrf task            4  of            8
 starting wrf task            2  of            8
 starting wrf task            6  of            8
 starting wrf task            0  of            8
 starting wrf task            1  of            8
 starting wrf task            3  of            8
 starting wrf task            7  of            8
 starting wrf task            5  of            8

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 3 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI3
 starting wrf task            1  of            3
 starting wrf task            0  of            3
 starting wrf task            2  of            3

[wrfuser@07e8670e1170 em_quarter_ss]$ ../../external/io_netcdf/diffwrf wrfout_d01_0001-01-01_00:00:00_MPI*
 Just plot  F
Diffing wrfout_d01_0001-01-01_00:00:00_MPI3 wrfout_d01_0001-01-01_00:00:00_MPI8
 Next Time 0001-01-01_00:00:00
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
 Next Time 0001-01-01_00:00:12
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
         U      1547    3   0.2350595367E+02   0.2350595368E+02   9   0.2985E-04   0.8158E-04
         V      2046    3   0.3400045178E+01   0.3400045176E+01   9   0.2973E-04   0.6012E-03
         W      3332    3   0.3461880543E-01   0.3461873113E-01   5   0.5524E-04   0.1863E-02
        PH      2129    3   0.5000753094E+03   0.5000753217E+03   7   0.7080E-03   0.3194E-04
         T       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
       THM       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
        MU        56    2   0.4388289142E+03   0.4388289142E+03  10   0.2492E-02   0.8765E-04
         P      1891    3   0.1184337757E+03   0.1184337754E+03   8   0.1062E-02   0.1777E-03
    QVAPOR       520    3   0.4564871386E-02   0.4564871385E-02  10   0.1237E-08   0.6452E-05

I generated the file to plot using only the fields that show a difference:

ncdiff --variable U,V,W,PH,T,THM,MU,P,QVAPOR  wrfout_d01_0001-01-01_00:00:00_MPI[38] diffmpi38.nc

The following diffs are after a single time step (the initial time is identical, as is usually the situation), with mpi tasks 3 vs 8. The other fields look similarly structured, though with ncview I had to rescale to get the values viewable for the specific level. The PH diff shows up at the first level up from the surface (ph=0 is a definition at the model surface).

MU
Screen Shot 2021-01-07 at 8 44 46 PM

U
Screen Shot 2021-01-07 at 8 45 49 PM

QVAPOR
Screen Shot 2021-01-07 at 8 46 42 PM

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 8, 2021

Here's the namelist for the QSS case (obviously without the .txt extension).
namelist.input.txt

The configure option "-d" is usually required to get identical results.

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 8, 2021

@rsarthur
In your local sandbox, see if only one of the three components is the problem (maybe try 12_21, 13_31, 23_32 separately, where the original code is used for the other two of the total three changes). The issue is the corners, and this is a single time step, so a print out in one of the corners could be useful.

We were "lucky" (for some definition of lucky) to find diffs by changing MPI ranks, so only a single MPI build is required to track this down.

Keep me in the loop with questions you have.

@matzegoebel
Copy link
Copy Markdown
Contributor

Thanks for publishing this PR. I also had it on my list but hadn't come around to do the PR. It's really strange that the check fails, since as you say you only use the same values of rho that are already used before. These values are made available by inclusion of HALO_EM_TKE_E and PERIOD_BDY_EM_PHY_BC and by the routine phy_bc in first_rk_step_part2. So I don't know what's causing the problem here.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Jan 8, 2021

@davegill I'm on it!
@matzegoebel You're welcome. Hopefully we can get this figured out soon.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Jan 9, 2021

@davegill
Using your suggestion I was able to isolate the issue to the cal_titau_12_21 subroutine. 13_31 and 23_32 appear to be fine. This now makes sense to me because 12_21 needs corner ghost points for rho. Maybe this makes the fix obvious to you? I am not quite familiar enough with WRF's MPI exchange code to know immediately what the fix is, but hopefully we are close. I'm signing off for the weekend but will check back in next week.

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Jan 9, 2021

@rsarthur
Great! I'll take a look.

@matzegoebel
Copy link
Copy Markdown
Contributor

I can imagine that the failing test has to do with PR #1314. There I added the corner points that were missing before in set_physical_bc3d. But anyway it's strange that the test did not fail previously.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Feb 4, 2021

@matzegoebel that makes sense. I see that PR #1314 was merged into the 4.2.2 release. @davegill, is it possible to retest this PR in the latest version of the code to see if the issue is fixed? Do you need anything from me to do that?

@davegill
Copy link
Copy Markdown
Contributor

davegill commented Feb 5, 2021

@rsarthur
You can grab the latest develop branch, add these mods into your new branch, push a new PR.

@rsarthur
Copy link
Copy Markdown
Contributor Author

rsarthur commented Feb 9, 2021

This PR has been moved to #1396 to take advantage of an update to corner point boundary conditions (#1314) included in the v4.2.2 release.

@rsarthur rsarthur deleted the m_opt_rho_fix branch February 12, 2021 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants