====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
checkpoint_techniques_on_compute_canada_clusters [2015/03/30 18:54] 132.216.122.26 |
checkpoint_techniques_on_compute_canada_clusters [2016/11/03 17:23] (current) |
||
---|---|---|---|
Line 45: | Line 45: | ||
# 7779 by default, but if there are several DMTCP schedulers running on | # 7779 by default, but if there are several DMTCP schedulers running on | ||
# the same node we will have problems. The best solution is to assign the | # the same node we will have problems. The best solution is to assign the | ||
- | # port number manually. | + | # port number manually. Also, if PORT=0, a random unused port will be |
+ | # chosen, which is probably better. | ||
PORT=7745 | PORT=7745 | ||
Line 78: | Line 79: | ||
# New version of this script. Now we use DMTCP to launch | # New version of this script. Now we use DMTCP to launch | ||
- | # the scripts (and gnu-parallel). | + | # the scripts. |
def chunks(l, n): | def chunks(l, n): | ||
Line 165: | Line 166: | ||
</file> | </file> | ||
- | In the end, this script generates a bunch of ''.sh'' files (as many as needed), splits the input files in chunks of ''NPROCS'' within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''qsub'' them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. | + | In the end, this script generates a bunch of ''.sh'' files (as many as needed), splits the input files in chunks of ''NPROCS'' within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''qsub'' them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. |
+ | |||
+ | **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | ||
+ | |||
+ | **Update 2: they did not reply.** |