Memory problem – in #9: CCLM

in #9: CCLM

Dear all,

I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® CPU E7540 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has SLURM . Then, I tried to adapt the run-scripts to SLURM batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole RAM of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?

P.S. I have already tried the new version of starter package that adapted to SLURM and it didn’t solve the problem.

Thank you very much.

—————————————————————————————————————
mpirun noticed that process rank 9 with PID 14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————

  @cemreyürük in #fb963cf

Dear all,

I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® CPU E7540 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has SLURM . Then, I tried to adapt the run-scripts to SLURM batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole RAM of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?

P.S. I have already tried the new version of starter package that adapted to SLURM and it didn’t solve the problem.

Thank you very much.

—————————————————————————————————————
mpirun noticed that process rank 9 with PID 14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————

Memory problem

Dear all,

I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® CPU E7540 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has SLURM . Then, I tried to adapt the run-scripts to SLURM batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole RAM of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?

P.S. I have already tried the new version of starter package that adapted to SLURM and it didn’t solve the problem.

Thank you very much.

—————————————————————————————————————
mpirun noticed that process rank 9 with PID 14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————

View in channel

Dear Cemre,

It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your SLURM settings? It may be help if you could provide some more information on the domain size and your current SLURM settings.

With best regards,
Markus

  @redc_migration in #95a612c

Dear Cemre,

It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your SLURM settings? It may be help if you could provide some more information on the domain size and your current SLURM settings.

With best regards,
Markus

Dear Cemre,

It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your SLURM settings? It may be help if you could provide some more information on the domain size and your current SLURM settings.

With best regards,
Markus

Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my SLURM settings.

Best regards,
Cemre

  @cemreyürük in #6fbd0f4

Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my SLURM settings.

Best regards,
Cemre

Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my SLURM settings.

Best regards,
Cemre

Dear Cemre,

as far as I see the run script und log files look okay.

As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:

#SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte

In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of SLURM and may gives some more error information.

Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.

Best regards
Markus

  @redc_migration in #5b4ca66

Dear Cemre,

as far as I see the run script und log files look okay.

As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:

#SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte

In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of SLURM and may gives some more error information.

Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.

Best regards
Markus

Dear Cemre,

as far as I see the run script und log files look okay.

As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:

#SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte

In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of SLURM and may gives some more error information.

Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.

Best regards
Markus

Dear Markus,

According to your suggestion, I added “ SBATCH —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).

In addition to this, I made another test run (for 0.0275° resolution) by adding SBATCH commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.

Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB RAM ), the time until crash gets longer as well. However, it consumes 256GB RAM and also 13GB swap and eventually it is terminated by oom-killer.

Best regards,
Cemre

  @cemreyürük in #f8bd3ca

Dear Markus,

According to your suggestion, I added “ SBATCH —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).

In addition to this, I made another test run (for 0.0275° resolution) by adding SBATCH commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.

Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB RAM ), the time until crash gets longer as well. However, it consumes 256GB RAM and also 13GB swap and eventually it is terminated by oom-killer.

Best regards,
Cemre

Dear Markus,

According to your suggestion, I added “ SBATCH —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).

In addition to this, I made another test run (for 0.0275° resolution) by adding SBATCH commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.

Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB RAM ), the time until crash gets longer as well. However, it consumes 256GB RAM and also 13GB swap and eventually it is terminated by oom-killer.

Best regards,
Cemre

Dear Cemre,

Memory usage of the 0.44 deg CCLM job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.

I will try to reproduce that behaviour with your configuration on the DKRZ machine next week. May you could try the intel compiler if its available on your system.

In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).

Best regards,
Markus

  @redc_migration in #b9c4109

Dear Cemre,

Memory usage of the 0.44 deg CCLM job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.

I will try to reproduce that behaviour with your configuration on the DKRZ machine next week. May you could try the intel compiler if its available on your system.

In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).

Best regards,
Markus

Dear Cemre,

Memory usage of the 0.44 deg CCLM job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.

I will try to reproduce that behaviour with your configuration on the DKRZ machine next week. May you could try the intel compiler if its available on your system.

In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).

Best regards,
Markus