Just some questions I had while using this cluster system.
It's 30 days. You can run jobs for as long as that and you don't need to ask permission to an administrator. Also, have in mind that this limit cannot be extended.
If you are going to run very long jobs, please consult the checkpointing manual, which contains notes from a Calcul Québec workshop.
Happened to me once. See this:
$ showq -r -u jmateos ... 30486658 R gm- 93.14 1.2 no jmateos eim-670- sw-2r14-n14 12 1:16:49:28 Thu Apr 16 03:51:21 30486659 R gm- 88.11 1.3 no jmateos eim-670- sw-2r15-n70 12 1:18:30:22 Thu Apr 16 05:32:15 30621080 R gm- 91.52 1.0 no jmateos atlaspt sw-2r14-n13 12 3:03:19:00 Thu Apr 16 14:20:53 ...
30621080 seems to we running under allocation group
atlaspt and not one of our own. I contacted
guillimin support and they told me this:
This is an issue with our scheduler software that is only a cosmetic problem. The job 30621080 will be correctly charged to the account eim-670-aa. The 'group' parameter in the scheduler is distinct from one called 'account' which is the one used for accounting purposes. The group parameter is correct on the worker nodes, but gets confused at some point on the scheduler node. However, it will not affect your jobs.
So don't worry too much about this.
You can use the
-M switches in your script header. Example:
... #PBS -m abe #PBS -M firstname.lastname@example.org ...
-m abe option instructs the scheduler to send you and e-mail when your jobs start (b), finish (e) or abort due to an error (a). You can select any combination of the possible options.