CCLM in levante at DKRZ – in #9: CCLM

in #9: CCLM

<p> Dear Community, </p> <p> I am working together with Mingyue to solve the problem she mentioned before. We successfully compiled the COSMOv5 and also made it run on Levante. Everything seems to work except for the writing of NetCDF restart files. The creation of NetCDF-output and every other step works without any problems. The <span class="caps"> CCLM </span> -job runs until almost the end and crashes when writing the restart file. In a test, it worked when writing the restart file as binary but this is not really an option for us. Also using COSMOv6 is only a last solution because we want to continue our 2500 years of simulation (almost 1000 years already done) with as few changes as possible. 0: <span class="caps"> OPEN </span> : ncdf-file: 0: /work/bb1201/b381165_bb1152//mythos500bc_CCLM/restarts/lrfd7476030100o.nc 0: <strong> —————————————————————————————— </strong> 0: * <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> DETECTED 0: * IN <span class="caps"> ROUTINE </span> : output_data 0: * 0: * <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 2035 0: * Error writing netcdf 2D variable 0: <strong> —————————————————————————————— </strong> 0: ————————————————————————————————————— 0: <span class="caps"> MPI </span> _ABORT was invoked on rank 0 in communicator <span class="caps"> MPI </span> _COMM_WORLD 0: with errorcode 2035. 0: 0: <span class="caps"> NOTE </span> : invoking <span class="caps"> MPI </span> _ABORT causes Open <span class="caps"> MPI </span> to kill all <span class="caps"> MPI </span> processes. 0: You may or may not see output from other processes, depending on 0: exactly when Open <span class="caps"> MPI </span> kills them. 0: ————————————————————————————————————— <br/> srun: Job step aborted: Waiting up to 302 seconds for job step to finish. 0: slurmstepd: error: *** <span class="caps"> STEP </span> 1344625.0 ON l30116 <span class="caps"> CANCELLED </span> AT 2022-08-03T09:49:15 *** <br/> srun: error: l30116: task 0: Exited with exit code 243 <br/> srun: launch/slurm: _step_signal: Terminating StepId=1344625.0 <br/> srun: error: l30116: tasks 1-143: Killed <br/> srun: error: l30117: tasks 144-287: Killed </p> <p> Does anyone know about the differences between writing the output and writing the restart file? Or did anyone have to deal with similar problems? <br/> We would be very happy about any suggestion and help. </p> <p> Thank you very much and kind regards, <br/> Eva </p>

  @evanowatzki in #a83f01d

<p> Dear Community, </p> <p> I am working together with Mingyue to solve the problem she mentioned before. We successfully compiled the COSMOv5 and also made it run on Levante. Everything seems to work except for the writing of NetCDF restart files. The creation of NetCDF-output and every other step works without any problems. The <span class="caps"> CCLM </span> -job runs until almost the end and crashes when writing the restart file. In a test, it worked when writing the restart file as binary but this is not really an option for us. Also using COSMOv6 is only a last solution because we want to continue our 2500 years of simulation (almost 1000 years already done) with as few changes as possible. 0: <span class="caps"> OPEN </span> : ncdf-file: 0: /work/bb1201/b381165_bb1152//mythos500bc_CCLM/restarts/lrfd7476030100o.nc 0: <strong> —————————————————————————————— </strong> 0: * <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> DETECTED 0: * IN <span class="caps"> ROUTINE </span> : output_data 0: * 0: * <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 2035 0: * Error writing netcdf 2D variable 0: <strong> —————————————————————————————— </strong> 0: ————————————————————————————————————— 0: <span class="caps"> MPI </span> _ABORT was invoked on rank 0 in communicator <span class="caps"> MPI </span> _COMM_WORLD 0: with errorcode 2035. 0: 0: <span class="caps"> NOTE </span> : invoking <span class="caps"> MPI </span> _ABORT causes Open <span class="caps"> MPI </span> to kill all <span class="caps"> MPI </span> processes. 0: You may or may not see output from other processes, depending on 0: exactly when Open <span class="caps"> MPI </span> kills them. 0: ————————————————————————————————————— <br/> srun: Job step aborted: Waiting up to 302 seconds for job step to finish. 0: slurmstepd: error: *** <span class="caps"> STEP </span> 1344625.0 ON l30116 <span class="caps"> CANCELLED </span> AT 2022-08-03T09:49:15 *** <br/> srun: error: l30116: task 0: Exited with exit code 243 <br/> srun: launch/slurm: _step_signal: Terminating StepId=1344625.0 <br/> srun: error: l30116: tasks 1-143: Killed <br/> srun: error: l30117: tasks 144-287: Killed </p> <p> Does anyone know about the differences between writing the output and writing the restart file? Or did anyone have to deal with similar problems? <br/> We would be very happy about any suggestion and help. </p> <p> Thank you very much and kind regards, <br/> Eva </p>

Dear Community,

I am working together with Mingyue to solve the problem she mentioned before. We successfully compiled the COSMOv5 and also made it run on Levante. Everything seems to work except for the writing of NetCDF restart files. The creation of NetCDF-output and every other step works without any problems. The CCLM -job runs until almost the end and crashes when writing the restart file. In a test, it worked when writing the restart file as binary but this is not really an option for us. Also using COSMOv6 is only a last solution because we want to continue our 2500 years of simulation (almost 1000 years already done) with as few changes as possible. 0: OPEN : ncdf-file: 0: /work/bb1201/b381165_bb1152//mythos500bc_CCLM/restarts/lrfd7476030100o.nc 0: —————————————————————————————— 0: * PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED 0: * IN ROUTINE : output_data 0: * 0: * ERROR CODE is 2035 0: * Error writing netcdf 2D variable 0: —————————————————————————————— 0: ————————————————————————————————————— 0: MPI _ABORT was invoked on rank 0 in communicator MPI _COMM_WORLD 0: with errorcode 2035. 0: 0: NOTE : invoking MPI _ABORT causes Open MPI to kill all MPI processes. 0: You may or may not see output from other processes, depending on 0: exactly when Open MPI kills them. 0: —————————————————————————————————————
srun: Job step aborted: Waiting up to 302 seconds for job step to finish. 0: slurmstepd: error: *** STEP 1344625.0 ON l30116 CANCELLED AT 2022-08-03T09:49:15 ***
srun: error: l30116: task 0: Exited with exit code 243
srun: launch/slurm: _step_signal: Terminating StepId=1344625.0
srun: error: l30116: tasks 1-143: Killed
srun: error: l30117: tasks 144-287: Killed

Does anyone know about the differences between writing the output and writing the restart file? Or did anyone have to deal with similar problems?
We would be very happy about any suggestion and help.

Thank you very much and kind regards,
Eva