====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
checkpoint_techniques_on_compute_canada_clusters [2015/03/30 20:47] 132.216.122.26 [Automatic checkpoints] |
checkpoint_techniques_on_compute_canada_clusters [2016/11/03 17:23] (current) |
||
---|---|---|---|
Line 45: | Line 45: | ||
# 7779 by default, but if there are several DMTCP schedulers running on | # 7779 by default, but if there are several DMTCP schedulers running on | ||
# the same node we will have problems. The best solution is to assign the | # the same node we will have problems. The best solution is to assign the | ||
- | # port number manually. | + | # port number manually. Also, if PORT=0, a random unused port will be |
+ | # chosen, which is probably better. | ||
PORT=7745 | PORT=7745 | ||
Line 78: | Line 79: | ||
# New version of this script. Now we use DMTCP to launch | # New version of this script. Now we use DMTCP to launch | ||
- | # the scripts (and gnu-parallel). | + | # the scripts. |
def chunks(l, n): | def chunks(l, n): | ||
Line 168: | Line 169: | ||
**Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | ||
+ | |||
+ | **Update 2: they did not reply.** |