Restarting finished job – in #9: CCLM

in #9: CCLM

<p> Hi everybody, <br/> I have finished a 5 year simulation and the output files in <span class="caps"> SCRATCH </span> directory have been removed. Now I want to continue the experiment for another year. Please let me know if this is possible and what are the actions required. <br/> Kind regards, Simon </p>

  @redc_migration in #37b2e3c

<p> Hi everybody, <br/> I have finished a 5 year simulation and the output files in <span class="caps"> SCRATCH </span> directory have been removed. Now I want to continue the experiment for another year. Please let me know if this is possible and what are the actions required. <br/> Kind regards, Simon </p>

Restarting finished job

Hi everybody,
I have finished a 5 year simulation and the output files in SCRATCH directory have been removed. Now I want to continue the experiment for another year. Please let me know if this is possible and what are the actions required.
Kind regards, Simon

View in channel
<p> If you use the latest subchain version you can create the directory structure with <br/> <pre> subchain create </pre> <br/> otherwise you have to do this by hand. In this case look at the section <br/> <code> # create the job directory structure </code> <br/> in the subchain script where the directories are created. <br/> I assume you want to perform a warm start, i.e. prolonging the run for another year. In this case do not change <code> YDATE_START </code> , but just adopt <code> YDATE_STOP </code> . </p>

  @burkhardtrockel in #9244eff

<p> If you use the latest subchain version you can create the directory structure with <br/> <pre> subchain create </pre> <br/> otherwise you have to do this by hand. In this case look at the section <br/> <code> # create the job directory structure </code> <br/> in the subchain script where the directories are created. <br/> I assume you want to perform a warm start, i.e. prolonging the run for another year. In this case do not change <code> YDATE_START </code> , but just adopt <code> YDATE_STOP </code> . </p>

If you use the latest subchain version you can create the directory structure with

subchain create

otherwise you have to do this by hand. In this case look at the section
# create the job directory structure
in the subchain script where the directories are created.
I assume you want to perform a warm start, i.e. prolonging the run for another year. In this case do not change YDATE_START , but just adopt YDATE_STOP .

<p> Thanks much. <br/> I apparently did something wrong. </p> <p> I use the (1.3.4) subchain version (is it the latest?). <br/> I submitted ./subchain create and now have the ….chain/scratch/…directory (which is empty). <br/> The date.log howvere contains the date which is equial to <span class="caps"> YDATE </span> _START. <br/> After that I attempted to submit a restart job ./subchain cclm <span class="caps"> DATE </span> <br/> where the <span class="caps"> DATE </span> is that from the original date.log file – i.e. not that created by the ./subchain create but the job has stopped – first the cclm and then the prep and int2lm. </p> <p> The subchain date.log and the log files + the files from jobs directory (tarred) are attached. <br/> Please have a look. <br/> Simon </p>

  @redc_migration in #5562dfb

<p> Thanks much. <br/> I apparently did something wrong. </p> <p> I use the (1.3.4) subchain version (is it the latest?). <br/> I submitted ./subchain create and now have the ….chain/scratch/…directory (which is empty). <br/> The date.log howvere contains the date which is equial to <span class="caps"> YDATE </span> _START. <br/> After that I attempted to submit a restart job ./subchain cclm <span class="caps"> DATE </span> <br/> where the <span class="caps"> DATE </span> is that from the original date.log file – i.e. not that created by the ./subchain create but the job has stopped – first the cclm and then the prep and int2lm. </p> <p> The subchain date.log and the log files + the files from jobs directory (tarred) are attached. <br/> Please have a look. <br/> Simon </p>

Thanks much.
I apparently did something wrong.

I use the (1.3.4) subchain version (is it the latest?).
I submitted ./subchain create and now have the ….chain/scratch/…directory (which is empty).
The date.log howvere contains the date which is equial to YDATE _START.
After that I attempted to submit a restart job ./subchain cclm DATE
where the DATE is that from the original date.log file – i.e. not that created by the ./subchain create but the job has stopped – first the cclm and then the prep and int2lm.

The subchain date.log and the log files + the files from jobs directory (tarred) are attached.
Please have a look.
Simon

<p> 1.3.4 is the lastest subchain released version. <br/> You wrote: <br/> <em> The date.log however contains the date which is equial to <span class="caps"> YDATE </span> _START. </em> <br/> but actually in your subchain <span class="caps"> YSTART </span> _DATE=1989010100 and in date.log it is 1994010100, which is OK. <br/> I guess you have to create the input data for cclm first. Please run in your case <br/> <code> ./subchain prep 1994010100 </code> <br/> If everything goes well, this job should call int2lm and later cclm automatically. <br/> By the way, calling <code> ./subchain cclm </code> always takes the date from date.log. A second argument will be ignored. </p>

  @burkhardtrockel in #1015893

<p> 1.3.4 is the lastest subchain released version. <br/> You wrote: <br/> <em> The date.log however contains the date which is equial to <span class="caps"> YDATE </span> _START. </em> <br/> but actually in your subchain <span class="caps"> YSTART </span> _DATE=1989010100 and in date.log it is 1994010100, which is OK. <br/> I guess you have to create the input data for cclm first. Please run in your case <br/> <code> ./subchain prep 1994010100 </code> <br/> If everything goes well, this job should call int2lm and later cclm automatically. <br/> By the way, calling <code> ./subchain cclm </code> always takes the date from date.log. A second argument will be ignored. </p>

1.3.4 is the lastest subchain released version.
You wrote:
The date.log however contains the date which is equial to YDATE _START.
but actually in your subchain YSTART _DATE=1989010100 and in date.log it is 1994010100, which is OK.
I guess you have to create the input data for cclm first. Please run in your case
./subchain prep 1994010100
If everything goes well, this job should call int2lm and later cclm automatically.
By the way, calling ./subchain cclm always takes the date from date.log. A second argument will be ignored.

<p> This ./subchain prep 1994010100 job really called int2lm but stopped after it. I attach the log file obtained. <br/> The last thing it did was the creation of two directories 1994_01 and 1994_02 in ..scratch/output/int2lm . <br/> Any hint, please. </p>

  @redc_migration in #374f749

<p> This ./subchain prep 1994010100 job really called int2lm but stopped after it. I attach the log file obtained. <br/> The last thing it did was the creation of two directories 1994_01 and 1994_02 in ..scratch/output/int2lm . <br/> Any hint, please. </p>

This ./subchain prep 1994010100 job really called int2lm but stopped after it. I attach the log file obtained.
The last thing it did was the creation of two directories 1994_01 and 1994_02 in ..scratch/output/int2lm .
Any hint, please.

<p> Beate just found an error in the subchain script. In case you call <code> subchain create </code> the following command at around line 95 should not be called: <br/> <pre> echo ${YDATE_START} ${YDATE_START} &gt; ${PFDIR}/${EXPID}/date.log </pre> <br/> This is only for a cold start. Please check if you have not overwritten date.log when you used <code> subchain create </code> <br/> For a warm start there should be 1994010100 in your case in the date.log file. </p>

  @burkhardtrockel in #6c6f4d7

<p> Beate just found an error in the subchain script. In case you call <code> subchain create </code> the following command at around line 95 should not be called: <br/> <pre> echo ${YDATE_START} ${YDATE_START} &gt; ${PFDIR}/${EXPID}/date.log </pre> <br/> This is only for a cold start. Please check if you have not overwritten date.log when you used <code> subchain create </code> <br/> For a warm start there should be 1994010100 in your case in the date.log file. </p>

Beate just found an error in the subchain script. In case you call subchain create the following command at around line 95 should not be called:

  echo ${YDATE_START} ${YDATE_START} > ${PFDIR}/${EXPID}/date.log

This is only for a cold start. Please check if you have not overwritten date.log when you used subchain create
For a warm start there should be 1994010100 in your case in the date.log file.

<p> Do you mean that in my case (warm start) then line has to be commented, but must exist in the case of cold start? <br/> With this correction I tried submitting ./subchain cclm 1994010100 (with 1994010100 1994010100 in date.log [why twice in fact?), but this didn’t work. <br/> Should I try submitting subchain prep for an earlier date probably?, like ./subchain prep 1994010100 or ./subchain prep 1993120100 ? </p>

  @redc_migration in #2150457

<p> Do you mean that in my case (warm start) then line has to be commented, but must exist in the case of cold start? <br/> With this correction I tried submitting ./subchain cclm 1994010100 (with 1994010100 1994010100 in date.log [why twice in fact?), but this didn’t work. <br/> Should I try submitting subchain prep for an earlier date probably?, like ./subchain prep 1994010100 or ./subchain prep 1993120100 ? </p>

Do you mean that in my case (warm start) then line has to be commented, but must exist in the case of cold start?
With this correction I tried submitting ./subchain cclm 1994010100 (with 1994010100 1994010100 in date.log [why twice in fact?), but this didn’t work.
Should I try submitting subchain prep for an earlier date probably?, like ./subchain prep 1994010100 or ./subchain prep 1993120100 ?

<p> If date.log contains <code> 1994010100 1994010100 </code> then <code> ./subchain prep 1994010100 </code> should work and start the chain again. Otherwise you mixed something up in the chain. <br/> The two dates in date.log are just for the case of running sub monthly chunks. This is not the case in your run, just leave it as it is. </p>

  @burkhardtrockel in #3b4a7d7

<p> If date.log contains <code> 1994010100 1994010100 </code> then <code> ./subchain prep 1994010100 </code> should work and start the chain again. Otherwise you mixed something up in the chain. <br/> The two dates in date.log are just for the case of running sub monthly chunks. This is not the case in your run, just leave it as it is. </p>

If date.log contains 1994010100 1994010100 then ./subchain prep 1994010100 should work and start the chain again. Otherwise you mixed something up in the chain.
The two dates in date.log are just for the case of running sub monthly chunks. This is not the case in your run, just leave it as it is.

<p> Thank you. It doesn’t work. It may be my mistake of course, but I do not think I made any change in the scripts except for that suggested by you in the subchain (commented the line echo ${YDATE_START} ${YDATE_START} &gt; ${PFDIR}/${EXPID}/date.log). [By the way – I work with cclm-sp_1.4 and not with the 1.3.4.Should I try restarting the job using 1.3.4 ?]. <br/> To summarize <br/> my date.log is as follows 1994010100 1994010100 <br/> the job ./subchain prep 1994010100 starts successfully and calls int2lm but doesn’t call cclm. <br/> I tried submitting ./subchain cclm 1994010100 after that (and also before) but it terminates with <span class="caps"> ERROR </span> <span class="caps"> CODE </span> 2014 in <span class="caps"> ROUTINE </span> organize_input <br/> after attempting to open ncdf file lbff**000000.nc <br/> No such file or directory </p> <p> ================================ </p> <p> But, I many times successfully restarted my jobs from consecutive last time moments (i.e. when the experiment was not yet finished – and all the data in the /scratch directory were not removed and the last created files still were there). May it be that restarting is possible for last time moments only. Or, in principle, one should be able to restart his job from any time moment (where the input data are supposed to come from if yes?). <br/> Please kindly clarify. </p>

  @redc_migration in #0f4710f

<p> Thank you. It doesn’t work. It may be my mistake of course, but I do not think I made any change in the scripts except for that suggested by you in the subchain (commented the line echo ${YDATE_START} ${YDATE_START} &gt; ${PFDIR}/${EXPID}/date.log). [By the way – I work with cclm-sp_1.4 and not with the 1.3.4.Should I try restarting the job using 1.3.4 ?]. <br/> To summarize <br/> my date.log is as follows 1994010100 1994010100 <br/> the job ./subchain prep 1994010100 starts successfully and calls int2lm but doesn’t call cclm. <br/> I tried submitting ./subchain cclm 1994010100 after that (and also before) but it terminates with <span class="caps"> ERROR </span> <span class="caps"> CODE </span> 2014 in <span class="caps"> ROUTINE </span> organize_input <br/> after attempting to open ncdf file lbff**000000.nc <br/> No such file or directory </p> <p> ================================ </p> <p> But, I many times successfully restarted my jobs from consecutive last time moments (i.e. when the experiment was not yet finished – and all the data in the /scratch directory were not removed and the last created files still were there). May it be that restarting is possible for last time moments only. Or, in principle, one should be able to restart his job from any time moment (where the input data are supposed to come from if yes?). <br/> Please kindly clarify. </p>

Thank you. It doesn’t work. It may be my mistake of course, but I do not think I made any change in the scripts except for that suggested by you in the subchain (commented the line echo ${YDATE_START} ${YDATE_START} > ${PFDIR}/${EXPID}/date.log). [By the way – I work with cclm-sp_1.4 and not with the 1.3.4.Should I try restarting the job using 1.3.4 ?].
To summarize
my date.log is as follows 1994010100 1994010100
the job ./subchain prep 1994010100 starts successfully and calls int2lm but doesn’t call cclm.
I tried submitting ./subchain cclm 1994010100 after that (and also before) but it terminates with ERROR CODE 2014 in ROUTINE organize_input
after attempting to open ncdf file lbff**000000.nc
No such file or directory

================================

But, I many times successfully restarted my jobs from consecutive last time moments (i.e. when the experiment was not yet finished – and all the data in the /scratch directory were not removed and the last created files still were there). May it be that restarting is possible for last time moments only. Or, in principle, one should be able to restart his job from any time moment (where the input data are supposed to come from if yes?).
Please kindly clarify.

<p> I just made a test by myself and it worked fine. <br/> Please run again $./subchain prep 1994010100$ and if it does not work, please attach the log files for prep, int2lm and cclm that have been produced by the job. </p>

  @burkhardtrockel in #90a58b9

<p> I just made a test by myself and it worked fine. <br/> Please run again $./subchain prep 1994010100$ and if it does not work, please attach the log files for prep, int2lm and cclm that have been produced by the job. </p>

I just made a test by myself and it worked fine.
Please run again $./subchain prep 1994010100$ and if it does not work, please attach the log files for prep, int2lm and cclm that have been produced by the job.

<p> Please see the log files attached (except for the cclm since it has not started). Also there are my subchain, all jobs and results of ls -l for restarts directory. Many thanks indeed for your help. </p>

  @redc_migration in #636fc49

<p> Please see the log files attached (except for the cclm since it has not started). Also there are my subchain, all jobs and results of ls -l for restarts directory. Many thanks indeed for your help. </p>

Please see the log files attached (except for the cclm since it has not started). Also there are my subchain, all jobs and results of ls -l for restarts directory. Many thanks indeed for your help.

<p> Sorry, the subchain is attached here. </p>

  @redc_migration in #db1575c

<p> Sorry, the subchain is attached here. </p>

Sorry, the subchain is attached here.

<p> The prep and int2lm jobs you provide already created the data for 199402. <br/> Please check if the directory <br/> /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b3001/output/int2lm/1994_01/ <br/> contains the laf1994010100.nc file and all necessary and lbfd199401mmddhh.nc files. <br/> If these are available, perform the command <code> ./subchain cclm </code> and attach the resulting .job and joblog file for this to your reply. </p>

  @burkhardtrockel in #55e99e3

<p> The prep and int2lm jobs you provide already created the data for 199402. <br/> Please check if the directory <br/> /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b3001/output/int2lm/1994_01/ <br/> contains the laf1994010100.nc file and all necessary and lbfd199401mmddhh.nc files. <br/> If these are available, perform the command <code> ./subchain cclm </code> and attach the resulting .job and joblog file for this to your reply. </p>

The prep and int2lm jobs you provide already created the data for 199402.
Please check if the directory
/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b3001/output/int2lm/1994_01/
contains the laf1994010100.nc file and all necessary and lbfd199401mmddhh.nc files.
If these are available, perform the command ./subchain cclm and attach the resulting .job and joblog file for this to your reply.

<p> I have submitted the job and it runs now without any problem. So, the problem seems to be solved. Do not really understand how. Thanks much anyway, </p>

  @redc_migration in #4460aba

<p> I have submitted the job and it runs now without any problem. So, the problem seems to be solved. Do not really understand how. Thanks much anyway, </p>

I have submitted the job and it runs now without any problem. So, the problem seems to be solved. Do not really understand how. Thanks much anyway,

<p> I understood finally how I have managed to make my job running. I clearly made a mistake. As I see now cclm.job.tmpl file in /templates directory contains ydirini=@{YDIRINI}/’ and not ydirini=’/Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/work/b3001/restarts’‘, <br/> This means that by submitting ./subchain cclm 1994010100 in reality I have used a cold start and not the warm one as I wanted. <br/> Sorry for misleading information of yesterday. </p> <p> So, my problem remains unsolved apparently. Following your earlier recommendation I have repeated all my previous actions on another job b2001. Attached please find a tar file with the information on the files in /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_01/ and /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_02/ <br/> as well as the resulting .job and joblog file. </p>

  @redc_migration in #7817850

<p> I understood finally how I have managed to make my job running. I clearly made a mistake. As I see now cclm.job.tmpl file in /templates directory contains ydirini=@{YDIRINI}/’ and not ydirini=’/Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/work/b3001/restarts’‘, <br/> This means that by submitting ./subchain cclm 1994010100 in reality I have used a cold start and not the warm one as I wanted. <br/> Sorry for misleading information of yesterday. </p> <p> So, my problem remains unsolved apparently. Following your earlier recommendation I have repeated all my previous actions on another job b2001. Attached please find a tar file with the information on the files in /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_01/ and /Research/CLIMATE/Giora/COSMO- <span class="caps"> CLM </span> /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_02/ <br/> as well as the resulting .job and joblog file. </p>

I understood finally how I have managed to make my job running. I clearly made a mistake. As I see now cclm.job.tmpl file in /templates directory contains ydirini=@{YDIRINI}/’ and not ydirini=’/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/work/b3001/restarts’‘,
This means that by submitting ./subchain cclm 1994010100 in reality I have used a cold start and not the warm one as I wanted.
Sorry for misleading information of yesterday.

So, my problem remains unsolved apparently. Following your earlier recommendation I have repeated all my previous actions on another job b2001. Attached please find a tar file with the information on the files in /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_01/ and /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_02/
as well as the resulting .job and joblog file.

<p> You are still messing up something in your subchain script. <br/> In cclmb2001.job one can read <br/> <pre> ydirini='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts'', ydirbd='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/scratch/b2001/input/cclm/1994_01/', </pre> <br/> There is a ‘ too much in ydirini. <br/> Maybe this causes the error in cclm-b2.o1032872: <br/> <pre> OPEN: bina-file: /Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts/lrfd199 4010100o *** Restart: A default set for refatm parameters is used: 2 CLOSING bina FILE OPEN: ncdf-file: lbff**000000.nc No such file or directory </pre> <br/> Please attach the <span class="caps"> YUSPEFIC </span> and subchain files next time. These are of help to understand the problem. </p>

  @burkhardtrockel in #5bb7660

<p> You are still messing up something in your subchain script. <br/> In cclmb2001.job one can read <br/> <pre> ydirini='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts'', ydirbd='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/scratch/b2001/input/cclm/1994_01/', </pre> <br/> There is a ‘ too much in ydirini. <br/> Maybe this causes the error in cclm-b2.o1032872: <br/> <pre> OPEN: bina-file: /Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts/lrfd199 4010100o *** Restart: A default set for refatm parameters is used: 2 CLOSING bina FILE OPEN: ncdf-file: lbff**000000.nc No such file or directory </pre> <br/> Please attach the <span class="caps"> YUSPEFIC </span> and subchain files next time. These are of help to understand the problem. </p>

You are still messing up something in your subchain script.
In cclmb2001.job one can read

  ydirini='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts'',
  ydirbd='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/scratch/b2001/input/cclm/1994_01/',

There is a ‘ too much in ydirini.
Maybe this causes the error in cclm-b2.o1032872:
 OPEN: bina-file: 
 /Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts/lrfd199
 4010100o
  *** Restart: A default set for refatm parameters is used:            2
 CLOSING bina FILE
 OPEN: ncdf-file: lbff**000000.nc
 No such file or directory

Please attach the YUSPEFIC and subchain files next time. These are of help to understand the problem.