User Tools

Site Tools


checkpoint_techniques_on_compute_canada_clusters

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
checkpoint_techniques_on_compute_canada_clusters [2015/03/27 21:12]
132.216.122.26 [Automatic checkpoints]
checkpoint_techniques_on_compute_canada_clusters [2016/11/03 17:23] (current)
Line 1: Line 1:
-These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015. Might be useful for people who want to learn how to code this on their own programs. Please don't hesitate to edit this page if you feel I left something out, you want to add something on your own or my English sounds funny.+These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015 (the workshop materials can be found [[http://​www.hpc.mcgill.ca/​index.php/​training#​chkpt|here]].) ​Might be useful for people who want to learn how to code this on their own programs. Please don't hesitate to edit this page if you feel I left something out, you want to add something on your own or my English sounds funny.
  
 ===== Random stuff ===== ===== Random stuff =====
  
-  * Maximum ''​wall''​ time on ''​guillimin''​ is 30 days. You can reach this limit without the need to ask any administrator (someone told me the limit was 48 hours before, but now it does not seem to be the case.) Also, it cannot be raised, as the scheduler will kill jobs longer than that. If you want to run something that will take longer, checkpointing comes in handy.+  * Maximum ''​wall''​ time on ''​guillimin''​ is 30 days. You can reach this limit without the need to ask any administrator (someone told me the limit used to be 48 hours, but now it does not seem to be the case.) Also, it cannot be raised ​higher, as the scheduler will kill jobs longer than that. If you want to run something that will take longer, checkpointing comes in handy.
  
   * There is a [[https://​wiki.calculquebec.ca/​w/​Accueil|Calcul Québec wiki]].   * There is a [[https://​wiki.calculquebec.ca/​w/​Accueil|Calcul Québec wiki]].
Line 45: Line 45:
 # 7779 by default, but if there are several DMTCP schedulers running on  # 7779 by default, but if there are several DMTCP schedulers running on 
 # the same node we will have problems. The best solution is to assign the # the same node we will have problems. The best solution is to assign the
-# port number manually.+# port number manually. Also, if PORT=0, a random unused port will be  
 +# chosen, which is probably better.
 PORT=7745 PORT=7745
  
Line 78: Line 79:
  
 # New version of this script. Now we use DMTCP to launch # New version of this script. Now we use DMTCP to launch
-# the scripts ​(and gnu-parallel).+# the scripts.
  
 def chunks(l, n): def chunks(l, n):
Line 110: Line 111:
         id = id + 1         id = id + 1
         jobname = "​esmglobal_%02d"​ % id         jobname = "​esmglobal_%02d"​ % id
-        gnuparcommand = "​parallel -j %d --xapply rundmtcpjob ::: %s ::: %s" % \ 
-                        (len(batch),​ " "​.join(batch),​ \ 
-                         "​ "​.join([str(x) for x in ports])) 
- 
         btemp = """#​!/​bin/​bash         btemp = """#​!/​bin/​bash
 #PBS -A eim-670-aa #PBS -A eim-670-aa
Line 152: Line 149:
  
 cd /​home/​jmateos/​code/​devmodel/​devmodelR cd /​home/​jmateos/​code/​devmodel/​devmodelR
- 
-export -f rundmtcpjob 
- 
-%s 
- 
-# wait # with parallel it is not necessary 
  
 """​ % (len(batch),​ jobname, \ """​ % (len(batch),​ jobname, \
        ​optdir + '/'​ + jobname + '​.log',​ \        ​optdir + '/'​ + jobname + '​.log',​ \
-       ​optdir + '/'​ + jobname + '​.err'​, \ +       ​optdir + '/'​ + jobname + '​.err'​)
-       ​gnuparcommand)+
  
         jobsfile = jobname + '​.sh'​         jobsfile = jobname + '​.sh'​
         f = open(jobsfile,​ '​w'​)         f = open(jobsfile,​ '​w'​)
         f.write(btemp)         f.write(btemp)
 +        for i in range(len(batch)):​
 +            line = "​rundmtcpjob %s %d &​\n"​ % (batch[i], ports[i])
 +            f.write(line)
 +        f.write("​wait\n"​)
         f.close()         f.close()
         os.chmod(jobsfile,​ 0755)         os.chmod(jobsfile,​ 0755)
     # end for loop     # end for loop
- 
 </​file>​ </​file>​
  
-This script ​also uses GNU Parallelso you might need to add ''​module add gnu-parallel''​ to your ''​.bash_profile''​ (or to the script itselfbefore sending it to the clusters.+In the end, this script ​generates a bunch of ''​.sh''​ files (as many as needed)splits the input files in chunks of ''​NPROCS'' ​within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''​qsub'' ​them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. 
 + 
 +**Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** ​
  
-In the end, this script generates a bunch of ''​.sh''​ files (as many as needed), splits the input files in chunks of ''​NPROCS''​ within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''​qsub''​ them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP+**Update 2: they did not reply.**
checkpoint_techniques_on_compute_canada_clusters.1427490764.txt.gz · Last modified: 2016/11/03 17:23 (external edit)