#12: CCLM Starter Package Support

Cookies disclaimer

Our site saves small pieces of text information (cookies) on your device in order to verify your login. These cookies are essential to provide access to resources on this website and it will not work properly without. Learn more

Long term simulation

re wrote

Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon

View in channel

BR replied

Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an echo test nn after each line.

View in channel

re replied

I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?

View in channel

BR replied

The line let "SEC_CHECK=DATE2-DATE1" counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to SEC_CHECK=1 does not matter.
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the let command.
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting SEC_CHECK=1 is fine, if you do not need the time information for some reason.

View in channel

Push notifications in your browser are not yet configured.

Long term simulation – in #12: CCLM Starter Package Support

in #12: CCLM Starter Package Support

Cookies disclaimer

Long term simulation

Unread

Cookies disclaimer

Long term simulation