Long term simulation – in #12: CCLM Starter Package Support

in #12: CCLM Starter Package Support

Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon

  @redc_migration in #e350e2f

Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon

Long term simulation

Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon

View in channel

Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an echo test nn after each line.

  @burkhardtrockel in #310b87e

Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an echo test nn after each line.

Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an echo test nn after each line.

I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?

  @redc_migration in #e22a7fb

I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?

I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?

The line let "SEC_CHECK=DATE2-DATE1" counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to SEC_CHECK=1 does not matter.
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the let command.
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting SEC_CHECK=1 is fine, if you do not need the time information for some reason.

  @burkhardtrockel in #5e4efb2

The line let "SEC_CHECK=DATE2-DATE1" counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to SEC_CHECK=1 does not matter.
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the let command.
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting SEC_CHECK=1 is fine, if you do not need the time information for some reason.

The line let "SEC_CHECK=DATE2-DATE1" counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to SEC_CHECK=1 does not matter.
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the let command.
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting SEC_CHECK=1 is fine, if you do not need the time information for some reason.