Skip to content

[rrfs-mpas-jedi] err_exit did not kill a job immediately #953

@guoqing-noaa

Description

@guoqing-noaa

In exrrfs_jedivar.sh, we have the following script:

  source prep_step
  ${cpreq} "${EXECrrfs}"/mpasjedi_variational.x .
  ${MPI_RUN_CMD} ./mpasjedi_variational.x jedivar.yaml log.out
  # check the status
  export err=$?
  err_chk

One would expect that if jedivar fails, the job will be cancelled immediately.
However, sometimes, scancel may not kill a job immediately. This allows an ex-script run into the end and return a zero exit code to the calling J-job and did not report error correctly.

Here is what was recorded in a log file:

srun: Terminating StepId=211168429.0
slurmstepd: error: *** STEP 211168429.0 ON c6n0191 CANCELLED AT 2025-09-19T12:49:11 ***
srun: error: c6n0363: task 21: Segmentation fault
srun: error: c6n0365: tasks 40-41,45-46,48,55,58: Terminated
srun: error: c6n0191: tasks 0,5-6,11,15,18-19: Terminated
srun: error: c6n1748: tasks 60,65,68,70,72,74,77-79: Terminated
srun: error: c6n0363: tasks 20,26-27,33-37,39: Terminated
srun: error: c6n0363: tasks 22-24,28-32,38: Terminated
srun: error: c6n1748: tasks 61-64,66-67,69,71,73,75-76: Terminated
srun: error: c6n0191: tasks 1-4,7-10,12-14,16: Terminated
srun: error: c6n0365: tasks 42-44,47,49-54,56-57,59: Terminated
srun: error: c6n0191: task 17: Segmentation fault (core dumped)
srun: Force Terminated StepId=211168429.0
+ exrrfs_jedivar.sh[143]: export err=143
+ exrrfs_jedivar.sh[143]: err=143
+ exrrfs_jedivar.sh[144]: err_chk

-------------------------------------------------------------
-- FATAL ERROR: Job jedivar_00 failed RETURN CODE 143
-- ABNORMAL EXIT at Fri 19 Sep 2025 12:49:19 PM EDT on c6n0191
-------------------------------------------------------------
...
Job jedivar_00 failed RETURN CODE 143
/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00
total 9990223
lrwxrwxrwx 1 Guoqing.Ge arfs-gsl         131 Sep 19 12:47 CAM_ABS_DATA.DBL -> /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/rrfs-workflow.20250919/fix/physics/convection_permitting/CAM_ABS_DATA.DBL
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96675 Sep 19 12:49 log.out.000020
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96673 Sep 19 12:49 log.out.000028
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96672 Sep 19 12:49 log.out.000016
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96673 Sep 19 12:49 log.out.000008
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96673 Sep 19 12:49 log.out.000038
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96673 Sep 19 12:49 log.out.000064
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96679 Sep 19 12:49 log.out.000027
-rw-r--r-- 1 Guoqing.Ge arfs-gsl       96679 Sep 19 12:49 log.out.000040
-rw------- 1 Guoqing.Ge arfs-gsl 10385940480 Sep 19 12:49 core
cat: /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/OUTPUT.924954: No such file or directory
+ exrrfs_jedivar.sh[146]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/init.nc /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det/init.2024-05-06_00.00.00.nc
+ exrrfs_jedivar.sh[147]: cp '/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jdiag*' /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
cp: cannot stat '/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jdiag*': No such file or directory
+ exrrfs_jedivar.sh[148]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jedivar_old001.yaml /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jedivar.yaml /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
+ exrrfs_jedivar.sh[149]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/log.out /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
++ exrrfs_jedivar.sh[159]: shopt -p nullglob
+ exrrfs_jedivar.sh[159]: nullglob_save='shopt -u nullglob'
+ exrrfs_jedivar.sh[160]: shopt -s nullglob
+ exrrfs_jedivar.sh[170]: satbias_list=(data/satbias_out/*satbias*.nc)
+ exrrfs_jedivar.sh[171]: ((  0 > 0  ))
+ exrrfs_jedivar.sh[175]: eval 'shopt -u nullglob'
++ exrrfs_jedivar.sh[175]: shopt -u nullglob
+ exrrfs_jedivar.sh[178]: exit 0
+ JRRFS_JEDIVAR[50]: export err=0
+ JRRFS_JEDIVAR[50]: err=0
+ JRRFS_JEDIVAR[50]: err_chk
 completed cleanly
+ JRRFS_JEDIVAR[52]: [[ -e /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/OUTPUT.924954 ]]
+ JRRFS_JEDIVAR[59]: [[ NO == \N\O ]]
+ JRRFS_JEDIVAR[59]: rm -rf /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00
+ JRRFS_JEDIVAR[61]: date
Fri 19 Sep 2025 12:49:21 PM EDT
+ JRRFS_JEDIVAR[62]: echo 'JOB jedivar_00 HAS COMPLETED NORMALLY!'
JOB jedivar_00 HAS COMPLETED NORMALLY!
+ JRRFS_JEDIVAR[63]: exit 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions