Dear Community,
I am working together with Mingyue to solve the problem she mentioned before. We successfully compiled the COSMOv5 and also made it run on Levante. Everything seems to work except for the writing of NetCDF restart files. The creation of NetCDF-output and every other step works without any problems. The
CCLM
-job runs until almost the end and crashes when writing the restart file. In a test, it worked when writing the restart file as binary but this is not really an option for us. Also using COSMOv6 is only a last solution because we want to continue our 2500 years of simulation (almost 1000 years already done) with as few changes as possible.
0:
OPEN
: ncdf-file:
0: /work/bb1201/b381165_bb1152//mythos500bc_CCLM/restarts/lrfd7476030100o.nc
0:
——————————————————————————————
0: *
PROGRAM
TERMINATED
BECAUSE
OF
ERRORS
DETECTED
0: * IN
ROUTINE
: output_data
0: *
0: *
ERROR
CODE
is 2035
0: * Error writing netcdf 2D variable
0:
——————————————————————————————
0: —————————————————————————————————————
0:
MPI
_ABORT was invoked on rank 0 in communicator
MPI
_COMM_WORLD
0: with errorcode 2035.
0:
0:
NOTE
: invoking
MPI
_ABORT causes Open
MPI
to kill all
MPI
processes.
0: You may or may not see output from other processes, depending on
0: exactly when Open
MPI
kills them.
0: —————————————————————————————————————
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
0: slurmstepd: error: ***
STEP
1344625.0 ON l30116
CANCELLED
AT 2022-08-03T09:49:15 ***
srun: error: l30116: task 0: Exited with exit code 243
srun: launch/slurm: _step_signal: Terminating StepId=1344625.0
srun: error: l30116: tasks 1-143: Killed
srun: error: l30117: tasks 144-287: Killed
Does anyone know about the differences between writing the output and writing the restart file? Or did anyone have to deal with similar problems?
We would be very happy about any suggestion and help.
Thank you very much and kind regards,
Eva
Dear Community,
I am working together with Mingyue to solve the problem she mentioned before. We successfully compiled the COSMOv5 and also made it run on Levante. Everything seems to work except for the writing of NetCDF restart files. The creation of NetCDF-output and every other step works without any problems. The CCLM -job runs until almost the end and crashes when writing the restart file. In a test, it worked when writing the restart file as binary but this is not really an option for us. Also using COSMOv6 is only a last solution because we want to continue our 2500 years of simulation (almost 1000 years already done) with as few changes as possible. 0: OPEN : ncdf-file: 0: /work/bb1201/b381165_bb1152//mythos500bc_CCLM/restarts/lrfd7476030100o.nc 0: —————————————————————————————— 0: * PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED 0: * IN ROUTINE : output_data 0: * 0: * ERROR CODE is 2035 0: * Error writing netcdf 2D variable 0: —————————————————————————————— 0: ————————————————————————————————————— 0: MPI _ABORT was invoked on rank 0 in communicator MPI _COMM_WORLD 0: with errorcode 2035. 0: 0: NOTE : invoking MPI _ABORT causes Open MPI to kill all MPI processes. 0: You may or may not see output from other processes, depending on 0: exactly when Open MPI kills them. 0: —————————————————————————————————————
srun: Job step aborted: Waiting up to 302 seconds for job step to finish. 0: slurmstepd: error: *** STEP 1344625.0 ON l30116 CANCELLED AT 2022-08-03T09:49:15 ***
srun: error: l30116: task 0: Exited with exit code 243
srun: launch/slurm: _step_signal: Terminating StepId=1344625.0
srun: error: l30116: tasks 1-143: Killed
srun: error: l30117: tasks 144-287: Killed
Does anyone know about the differences between writing the output and writing the restart file? Or did anyone have to deal with similar problems?
We would be very happy about any suggestion and help.
Thank you very much and kind regards,
Eva