2
votes

Using torque, if I run a job with qsub with particular arguments, the job finishes and three things happen. 1) I get a file.eXXXX file containing the stderr of the process 2) I get a file.oXXXX file containing the stdout of the process 3) I receive an email with information such as allocation and exit status.

I'd like to have this status information in a file next to the .oXXXX and .eXXXX files, because it is too difficult to correlate 100s of emails with 100s of job output files especially several days later. I can't find such a capability built in. Nevertheless I noticed that I can use "qstat -f job-id" to get information pretty similar to what's in the email. But I don't see in the documentation how long a delay I am allowed for running qstat.

I thought about after launching the job A with qsub, thereafter use the job ID to launch a dependent job B (qsub -W depend=...) which will run "qstat -f" of the id of A, communicating id-A via an environment variable. However, I don't know how far in the future job B will run. Also if job B is not run on the same node as A, will qstat be able to find the correct information?

My idea seems convoluted. Isn't there an easier/better way of doing this?

I don't think this can be done by installing some sort of email monitor, because I read my email on a completely different machine which does not have access to the compute cluster.

1
If you just need to know whether the job finished normally or not, you could add an echo Success at the end of the job script and check for that line in the .oXXXXXX file. - Dmitri Chubarov
Yes of course I can examine the output file to figure out whether it finished successfully, but it would be nice to correlate the information in the email with the actual job output, for example the resources. It might be nice to examine how much time was used as opposed to how much was requested. I.e., there's potentially useful information in the email (on the email reading host) which cannot be easily correlated with the output files on the compute host. - Jim Newton
At the conclusion of a job, while it is still running the output of qstat -f already contains the information in resources_used parameters. It is likely that something like qstat -f $PBS_JOBID | grep resources_used should work when executed as the last line of the job script. - Dmitri Chubarov
That's a good idea Dmitri. It partially solves the problem. I can get the output of qstat -f $PBS_JOBID, find the line prefixed by "Output_Path =" to get the name of the output file (warning first unwrap lines because qstat -f annoyingly wraps long lines), and replace the .oXXXX extension with .sXXXX and then dump the output of qstat -f to that file. This takes care of the case where the job finishes successfully. However, it does not help if the job fails to finish successfully. In this case I still only get an uncorrelated email. - Jim Newton

1 Answers

1
votes

To handle the case of getting the output even if the job fails, include something like this near the top of the batch script after the PBS preamble:

trap "qstat -f $PBS_JOBID | grep resources_used" EXIT

This ensures that whenever the script exits for any reason the scriptlet in quotes is executed. As PBS kills your job with a SIGTERM signal and only resorts to SIGKILL if your script doesn't exit in response to the former, you should always be able to execute that trap. This also means you can remove the qstat command at the end of the script; the trap will be hit there too.

sh, bash, and derivatives support trap, but csh and derivatives do not.