User Tools

Site Tools


checkpoint_techniques_on_compute_canada_clusters

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
checkpoint_techniques_on_compute_canada_clusters [2015/03/30 18:54]
132.216.122.26
checkpoint_techniques_on_compute_canada_clusters [2016/11/03 17:23] (current)
Line 45: Line 45:
 # 7779 by default, but if there are several DMTCP schedulers running on  # 7779 by default, but if there are several DMTCP schedulers running on 
 # the same node we will have problems. The best solution is to assign the # the same node we will have problems. The best solution is to assign the
-# port number manually.+# port number manually. Also, if PORT=0, a random unused port will be  
 +# chosen, which is probably better.
 PORT=7745 PORT=7745
  
Line 78: Line 79:
  
 # New version of this script. Now we use DMTCP to launch # New version of this script. Now we use DMTCP to launch
-# the scripts ​(and gnu-parallel).+# the scripts.
  
 def chunks(l, n): def chunks(l, n):
Line 165: Line 166:
 </​file>​ </​file>​
  
-In the end, this script generates a bunch of ''​.sh''​ files (as many as needed), splits the input files in chunks of ''​NPROCS''​ within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''​qsub''​ them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. ​+In the end, this script generates a bunch of ''​.sh''​ files (as many as needed), splits the input files in chunks of ''​NPROCS''​ within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''​qsub''​ them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. 
 + 
 +**Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)**  
 + 
 +**Update 2: they did not reply.**
checkpoint_techniques_on_compute_canada_clusters.1427741691.txt.gz · Last modified: 2016/11/03 17:23 (external edit)