Memory problem – in #9: CCLM

in #9: CCLM

<p> Dear all, </p> <p> I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® <span class="caps"> CPU </span> E7540 <code> 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 </code> 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has <span class="caps"> SLURM </span> . Then, I tried to adapt the run-scripts to <span class="caps"> SLURM </span> batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole <span class="caps"> RAM </span> of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem? </p> <p> P.S. I have already tried the new version of starter package that adapted to <span class="caps"> SLURM </span> and it didn’t solve the problem. </p> <p> Thank you very much. </p> <p> ————————————————————————————————————— <br/> mpirun noticed that process rank 9 with <span class="caps"> PID </span> 14935 on node m003 exited on signal 9 (Killed). <br/> ————————————————————————————————————— </p>

  @cemreyürük in #fb963cf

<p> Dear all, </p> <p> I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® <span class="caps"> CPU </span> E7540 <code> 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 </code> 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has <span class="caps"> SLURM </span> . Then, I tried to adapt the run-scripts to <span class="caps"> SLURM </span> batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole <span class="caps"> RAM </span> of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem? </p> <p> P.S. I have already tried the new version of starter package that adapted to <span class="caps"> SLURM </span> and it didn’t solve the problem. </p> <p> Thank you very much. </p> <p> ————————————————————————————————————— <br/> mpirun noticed that process rank 9 with <span class="caps"> PID </span> 14935 on node m003 exited on signal 9 (Killed). <br/> ————————————————————————————————————— </p>

Memory problem

Dear all,

I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® CPU E7540 2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2 2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has SLURM . Then, I tried to adapt the run-scripts to SLURM batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole RAM of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?

P.S. I have already tried the new version of starter package that adapted to SLURM and it didn’t solve the problem.

Thank you very much.

—————————————————————————————————————
mpirun noticed that process rank 9 with PID 14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————

View in channel
<p> Dear Cemre, </p> <p> It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your <span class="caps"> SLURM </span> settings? It may be help if you could provide some more information on the domain size and your current <span class="caps"> SLURM </span> settings. </p> <p> With best regards, <br/> Markus </p>

  @redc_migration in #95a612c

<p> Dear Cemre, </p> <p> It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your <span class="caps"> SLURM </span> settings? It may be help if you could provide some more information on the domain size and your current <span class="caps"> SLURM </span> settings. </p> <p> With best regards, <br/> Markus </p>

Dear Cemre,

It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your SLURM settings? It may be help if you could provide some more information on the domain size and your current SLURM settings.

With best regards,
Markus

<p> Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my <span class="caps"> SLURM </span> settings. </p> <p> Best regards, <br/> Cemre </p>

  @cemreyürük in #6fbd0f4

<p> Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my <span class="caps"> SLURM </span> settings. </p> <p> Best regards, <br/> Cemre </p>

Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my SLURM settings.

Best regards,
Cemre

<p> Dear Cemre, </p> <p> as far as I see the run script und log files look okay. </p> <p> As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system: </p> <p> #SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte </p> <p> In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of <span class="caps"> SLURM </span> and may gives some more error information. </p> <p> Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case. </p> <p> Best regards <br/> Markus </p>

  @redc_migration in #5b4ca66

<p> Dear Cemre, </p> <p> as far as I see the run script und log files look okay. </p> <p> As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system: </p> <p> #SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte </p> <p> In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of <span class="caps"> SLURM </span> and may gives some more error information. </p> <p> Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case. </p> <p> Best regards <br/> Markus </p>

Dear Cemre,

as far as I see the run script und log files look okay.

As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:

#SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte

In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of SLURM and may gives some more error information.

Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.

Best regards
Markus

<p> Dear Markus, </p> <p> According to your suggestion, I added “ <span class="caps"> SBATCH </span> —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err). </p> <p> In addition to this, I made another test run (for 0.0275° resolution) by adding <span class="caps"> SBATCH </span> commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem. </p> <p> Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB <span class="caps"> RAM </span> ), the time until crash gets longer as well. However, it consumes 256GB <span class="caps"> RAM </span> and also 13GB swap and eventually it is terminated by oom-killer. </p> <p> Best regards, <br/> Cemre </p>

  @cemreyürük in #f8bd3ca

<p> Dear Markus, </p> <p> According to your suggestion, I added “ <span class="caps"> SBATCH </span> —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err). </p> <p> In addition to this, I made another test run (for 0.0275° resolution) by adding <span class="caps"> SBATCH </span> commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem. </p> <p> Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB <span class="caps"> RAM </span> ), the time until crash gets longer as well. However, it consumes 256GB <span class="caps"> RAM </span> and also 13GB swap and eventually it is terminated by oom-killer. </p> <p> Best regards, <br/> Cemre </p>

Dear Markus,

According to your suggestion, I added “ SBATCH —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).

In addition to this, I made another test run (for 0.0275° resolution) by adding SBATCH commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.

Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB RAM ), the time until crash gets longer as well. However, it consumes 256GB RAM and also 13GB swap and eventually it is terminated by oom-killer.

Best regards,
Cemre

<p> Dear Cemre, </p> <p> Memory usage of the 0.44 deg <span class="caps"> CCLM </span> job should be &lt; 2 GB. So its a serious problem. <br/> Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”. </p> <p> I will try to reproduce that behaviour with your configuration on the <span class="caps"> DKRZ </span> machine next week. May you could try the intel compiler if its available on your system. </p> <p> In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN). </p> <p> Best regards, <br/> Markus </p>

  @redc_migration in #b9c4109

<p> Dear Cemre, </p> <p> Memory usage of the 0.44 deg <span class="caps"> CCLM </span> job should be &lt; 2 GB. So its a serious problem. <br/> Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”. </p> <p> I will try to reproduce that behaviour with your configuration on the <span class="caps"> DKRZ </span> machine next week. May you could try the intel compiler if its available on your system. </p> <p> In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN). </p> <p> Best regards, <br/> Markus </p>

Dear Cemre,

Memory usage of the 0.44 deg CCLM job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.

I will try to reproduce that behaviour with your configuration on the DKRZ machine next week. May you could try the intel compiler if its available on your system.

In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).

Best regards,
Markus