-
Notifications
You must be signed in to change notification settings - Fork 65
[rrfs-mpas-jedi] err_exit did not kill a job immediately #953
Copy link
Copy link
Open
Description
In exrrfs_jedivar.sh, we have the following script:
source prep_step
${cpreq} "${EXECrrfs}"/mpasjedi_variational.x .
${MPI_RUN_CMD} ./mpasjedi_variational.x jedivar.yaml log.out
# check the status
export err=$?
err_chk
One would expect that if jedivar fails, the job will be cancelled immediately.
However, sometimes, scancel may not kill a job immediately. This allows an ex-script run into the end and return a zero exit code to the calling J-job and did not report error correctly.
Here is what was recorded in a log file:
srun: Terminating StepId=211168429.0
slurmstepd: error: *** STEP 211168429.0 ON c6n0191 CANCELLED AT 2025-09-19T12:49:11 ***
srun: error: c6n0363: task 21: Segmentation fault
srun: error: c6n0365: tasks 40-41,45-46,48,55,58: Terminated
srun: error: c6n0191: tasks 0,5-6,11,15,18-19: Terminated
srun: error: c6n1748: tasks 60,65,68,70,72,74,77-79: Terminated
srun: error: c6n0363: tasks 20,26-27,33-37,39: Terminated
srun: error: c6n0363: tasks 22-24,28-32,38: Terminated
srun: error: c6n1748: tasks 61-64,66-67,69,71,73,75-76: Terminated
srun: error: c6n0191: tasks 1-4,7-10,12-14,16: Terminated
srun: error: c6n0365: tasks 42-44,47,49-54,56-57,59: Terminated
srun: error: c6n0191: task 17: Segmentation fault (core dumped)
srun: Force Terminated StepId=211168429.0
+ exrrfs_jedivar.sh[143]: export err=143
+ exrrfs_jedivar.sh[143]: err=143
+ exrrfs_jedivar.sh[144]: err_chk
-------------------------------------------------------------
-- FATAL ERROR: Job jedivar_00 failed RETURN CODE 143
-- ABNORMAL EXIT at Fri 19 Sep 2025 12:49:19 PM EDT on c6n0191
-------------------------------------------------------------
...
Job jedivar_00 failed RETURN CODE 143
/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00
total 9990223
lrwxrwxrwx 1 Guoqing.Ge arfs-gsl 131 Sep 19 12:47 CAM_ABS_DATA.DBL -> /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/rrfs-workflow.20250919/fix/physics/convection_permitting/CAM_ABS_DATA.DBL
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96675 Sep 19 12:49 log.out.000020
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96673 Sep 19 12:49 log.out.000028
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96672 Sep 19 12:49 log.out.000016
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96673 Sep 19 12:49 log.out.000008
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96673 Sep 19 12:49 log.out.000038
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96673 Sep 19 12:49 log.out.000064
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96679 Sep 19 12:49 log.out.000027
-rw-r--r-- 1 Guoqing.Ge arfs-gsl 96679 Sep 19 12:49 log.out.000040
-rw------- 1 Guoqing.Ge arfs-gsl 10385940480 Sep 19 12:49 core
cat: /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/OUTPUT.924954: No such file or directory
+ exrrfs_jedivar.sh[146]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/init.nc /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det/init.2024-05-06_00.00.00.nc
+ exrrfs_jedivar.sh[147]: cp '/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jdiag*' /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
cp: cannot stat '/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jdiag*': No such file or directory
+ exrrfs_jedivar.sh[148]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jedivar_old001.yaml /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/jedivar.yaml /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
+ exrrfs_jedivar.sh[149]: cp /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/log.out /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/com/rrfs/v2.1.1/rrfs.20240506/00/jedivar/det
++ exrrfs_jedivar.sh[159]: shopt -p nullglob
+ exrrfs_jedivar.sh[159]: nullglob_save='shopt -u nullglob'
+ exrrfs_jedivar.sh[160]: shopt -s nullglob
+ exrrfs_jedivar.sh[170]: satbias_list=(data/satbias_out/*satbias*.nc)
+ exrrfs_jedivar.sh[171]: (( 0 > 0 ))
+ exrrfs_jedivar.sh[175]: eval 'shopt -u nullglob'
++ exrrfs_jedivar.sh[175]: shopt -u nullglob
+ exrrfs_jedivar.sh[178]: exit 0
+ JRRFS_JEDIVAR[50]: export err=0
+ JRRFS_JEDIVAR[50]: err=0
+ JRRFS_JEDIVAR[50]: err_chk
completed cleanly
+ JRRFS_JEDIVAR[52]: [[ -e /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00/OUTPUT.924954 ]]
+ JRRFS_JEDIVAR[59]: [[ NO == \N\O ]]
+ JRRFS_JEDIVAR[59]: rm -rf /gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/133coldDA/stmp/20240506/rrfs_jedivar_00_v2.1.1/det/jedivar_00
+ JRRFS_JEDIVAR[61]: date
Fri 19 Sep 2025 12:49:21 PM EDT
+ JRRFS_JEDIVAR[62]: echo 'JOB jedivar_00 HAS COMPLETED NORMALLY!'
JOB jedivar_00 HAS COMPLETED NORMALLY!
+ JRRFS_JEDIVAR[63]: exit 0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels