User Tools

Site Tools


checkpoint_techniques_on_compute_canada_clusters

These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015 (the workshop materials can be found [[http://www.hpc.mcgill.ca/index.php/training#chkpt|here]].) Might be useful for people who want to learn how to code this on their own programs. Please don't hesitate to edit this page if you feel I left something out, you want to add something on your own or my English sounds funny. ===== Random stuff ===== * Maximum ''wall'' time on ''guillimin'' is 30 days. You can reach this limit without the need to ask any administrator (someone told me the limit used to be 48 hours, but now it does not seem to be the case.) Also, it cannot be raised higher, as the scheduler will kill jobs longer than that. If you want to run something that will take longer, checkpointing comes in handy. * There is a [[https://wiki.calculquebec.ca/w/Accueil|Calcul Québec wiki]]. ===== The problem ===== Sometimes we want to run very long jobs on ''guillimin'' or any other cluster in the Compute Canada / Calcul Québec network. Say, 30 days. If something happens during the 29th day and the system crashes, we have to re-run everything from the start. This is far from optimal. ===== The solution: checkpoints ===== Checkpointing is simply saving intermediate states in your execution in a way that allows you to start from that point and not from the very beginning if something happens. Let's differentiate between two different checkpointing techniques. ==== Manual checkpoints ==== Manual checkpoints require you to modify your code to manually save your system state every N iterations (for example) and then, upon start, check if there is some previously saved state that can be used as the initial seed. In some cases this might not be possible, depending on your particular job (for instance, it might be a very long optimization process that depends on some internal hysteresis data that is not offered as output). The particular file format used for this task is up to you. In my case, I save intermediate parameter values in ''.Rdata'' format, as I am using R. The same goes for Matlab (''.mat''.) If you are using Python, you might want to check [[https://docs.python.org/2/library/pickle.html|Pickle]], if you didn't know it already. If you are using parallel programs, you might be interested in HDF5, which allows to read/write data in parallel environments. Check [[http://www.hdfgroup.org/HDF5|this webpage]] for more information. These libraries and programs are offered on ''guillimin'' simply by running ''module add HDF5/1.8.7-gcc'' (or ''module add HDF5/1.8.7-intel'', depending on the compiler you are using.) ==== Automatic checkpoints ==== This might be far more interesting, as it does not require the programmer to change the code. It works very well for simple, single node, single core, single-everything programs, and might be a little problematic for more complex tasks. There are several automatic checkpointing techniques. The workshop discussed [[http://dmtcp.sourceforge.net/|DMTCP]] mainly. It is available on ''guillimin'' (''module add DMTCP''). [[http://criu.org|CRIU]] might be available some time from now. The idea behind DMTCP is that you don't run your script directly, but let DMTCP control it for you. It will then create another script that you have to run if you need to re-run your job. This script will continue the process from a previously saved state. This is a sample script, slightly modified from the version used at the workshop: <file bash dmtcp.sh> #!/bin/bash #PBS -l nodes=1:ppn=1 #PBS -l walltime=1:00:00 #PBS -N example_job cd $PBS_O_WORKDIR # Add the module, just in case module add DMTCP # DMTCP communicates with our tasks through a socket. It uses port number # 7779 by default, but if there are several DMTCP schedulers running on # the same node we will have problems. The best solution is to assign the # port number manually. Also, if PORT=0, a random unused port will be # chosen, which is probably better. PORT=7745 # Check if there is a previous script and run it if it's the case # If not, just tell dmtcp_launch to run our script for us if [ -e "dmtcp_restart_script.sh" ] then # We launch the restart script with the port number and also # with the name of the remote node. If the first time it was run # on node n1 but upon restart we are assigned node n2, it will # fail unless we add that switch. ./dmtcp_restart_script.sh -p ${PORT} -h $(hostname) else # The -i switch tells dmtcp_launch the time in seconds between # checkpoints. Probably 60 is too small, so set it up accordingly. dmtcp_launch -i 60 -p ${PORT} ./my_job fi </file> That's it; we will use this script to launch our job and to re-launch it if we need to. DMTCP will regularly create system dumps and link ''dmtcp_restart_script.sh'' (created on the directory from which you run your job) to the relevant copy. === What if I want to launch N jobs using DMTCP? Is that possible? === Yes. You just have to remember to assign each job a different port number and run every job from its own subdirectory to avoid any conflict. Consider my particular case: I need to run several R programs (hundreds). Each script will run on one node. This is the Python code that I have used to generate the bash scripts that I will send to ''qsub'': <file python generatordmtcp.py> #!/usr/bin/env python import os # New version of this script. Now we use DMTCP to launch # the scripts. def chunks(l, n): """ Yield successive n-sized chunks from l. """ for i in xrange(0, len(l), n): yield l[i:i+n] ## MAIN ## if (__name__ == "__main__"): NPROCS = 12 # Get list of scripts to run. They are files with both # 'modelimp1_global' and 'CLUSTER' on their names. Each # script is one job to launch. scripts = os.listdir('.') scripts = filter(lambda x: x.find('CLUSTER') != -1, scripts) scripts = filter(lambda x: x.find('modelimp1_global') != -1, scripts) scripts.sort() id = 0 # We'll save temporary results in the projects directory, so we # don't have to worry about quotas on the scratch one. Might need # these data for several weeks. optdir = "/gs/project/eim-670-aa/jmateos/esmglobalfit" # Port list for DMTCP ports = range(7701, 7713) ## MAIN LOOP ## for batch in chunks(scripts, NPROCS): id = id + 1 jobname = "esmglobal_%02d" % id btemp = """#!/bin/bash #PBS -A eim-670-aa #PBS -l nodes=1:ppn=%d #PBS -l walltime=00:30:00 #PBS -V #PBS -N %s #PBS -o %s #PBS -e %s function rundmtcpjob () { jobfile=$1 port=$2 jobname=$(basename ${jobfile} .R) optdir=/gs/project/eim-670-aa/jmateos/esmglobalfit # Create job directory within ${optdir} and copy all files there # If it already exists, it might mean the script already run once, # so don't do anything. scdir=${optdir}/${jobname} if [ ! -e ${scdir} ] then mkdir ${scdir} cp -va * ${scdir} fi # Move to $scdir and run the script using dmtcp_launch, as in the # workshop. Will use that directory as the temporary one. cd ${scdir} if [ -e "dmtcp_restart_script.sh" ] then ./dmtcp_restart_script.sh -p ${port} -h $(hostname) else dmtcp_launch -i 86400 -p ${port} R CMD BATCH ${jobfile} fi } cd /home/jmateos/code/devmodel/devmodelR """ % (len(batch), jobname, \ optdir + '/' + jobname + '.log', \ optdir + '/' + jobname + '.err') jobsfile = jobname + '.sh' f = open(jobsfile, 'w') f.write(btemp) for i in range(len(batch)): line = "rundmtcpjob %s %d &\n" % (batch[i], ports[i]) f.write(line) f.write("wait\n") f.close() os.chmod(jobsfile, 0755) # end for loop </file> In the end, this script generates a bunch of ''.sh'' files (as many as needed), splits the input files in chunks of ''NPROCS'' within the same node and makes the scripts executable. The only thing needed afterwards is to check that all the parameters are correct and ''qsub'' them. It will use the personal project space (hardcoded, you will need to change this) to create task subdirectories for DMTCP. **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** **Update 2: they did not reply.**

checkpoint_techniques_on_compute_canada_clusters.txt · Last modified: 2016/11/03 17:23 (external edit)