0
votes

I know I asked the same question earlier from this link:

Setting SGE for running an executable with different input files on different nodes

Like I said in that thread, I worked with this kind of thing on SLURM system before without any issues, because everything is wrapped into one submission script. However, adapting from the previous question in the link above, here is my approach on SGE (I know this is a bad practice, but I really couldn't think of any better ways...)

The job is chained through 4+N scripts: run.sh, submitSerial.sh, wrap.sh, temp.sh, and job{1-N}.sh

run.sh: The main job script

#!/bin/bash

...some stuffs...
...create N directories to run N input files in parallel (like last problems)
...generate wrap.sh and job{1-N}.sh...

...parameters definition...

for (( i=0; i<=M; i++ ))
do
   ...generate submitSerial.sh...
   sh submitSerial.sh
   ...initialize boolean flag...
   while flag
   do
      sh wrap.sh
      ...run an executable to determine the flag status...
   done
done

...some cleanup...

submitSerial.sh and temp.sh: I need to execute this executable in serial first, and want the cluster to wait until this is done to proceed to the next line of procedures in run.sh. Since run.sh is not in the cluster environment (i.e. no Grid Engine parameters), but rather exists only in a login node, this will generate temp.sh and run a serial script through qsub right away. Since I don't know how to check whether a qsub job is done, so I had to do it the foolish way. Wonder if there's a better way to check?

#!/bin/bash

echo "#!/bin/bash" >> $workDir/temp.sh
echo >> $workDir/temp.sh
echo "#$ -N serialForce" >> $workDir/temp.sh
echo "#$ -q batch.q" >> $workDir/temp.sh
echo "#$ -l h_rt=0:10:00" >> $workDir/temp.sh
echo "#$ -pe orte 120" >> $workDir/temp.sh
echo "#$ -wd /path/to/working/dir/" >> $workDir/temp.sh
echo "#$ -j y" >> $workDir/temp.sh
echo "#$ -S /bin/bash" >> $workDir/temp.sh
echo "#$ -V" >> $workDir/temp.sh
echo >> $workDir/temp.sh
echo "mpirun -np 120 nwchem-6.5 $workDir/step0_1.nw" >> $workDir/temp.sh

qsub $workDir/temp.sh

while true
do
   qstat > $workDir/temp
   if [ -s $workDir/temp ]
   then
      sleep 1
   else
      rm $workDir/temp
      break
   fi
done

rm $workDir/temp.sh

wrap.sh and job{1-N}.sh: This was generated earlier at the beginning of the script. This is the part that was my question last time, and I also used sleep to check the qsub status as well

#!/bin/bash

for i in {1..10}
do
   qsub $workDir/wd$i/job$i.sh
done

while true
do
   qstat > $workDir/temp
   if [ -s $workDir/temp ]
   then
      sleep 1
   else
      rm $workDir/temp
      break
   fi
done

for j in {1..10}
do
   rm $workDir/wd$j/*
done

The problem with this approach is once I run run.sh, I can't do it in the background and with having to do separate qsub's there is a potential problem if the cluster is full. I wonder if there is a solution with only one qsub like the SLURM approach? I just want to submit the job and just wait until it's done rather than having the script submitting multiple qsub jobs without knowing if any unknown jobs die in the middle (and I never have an idea where it dies).

Please help me with this! Your help is highly appreciated! Thank you very much in advance!

1

1 Answers

0
votes

Can you please be more specific and clear on what you are having issue with. I would appear that the last question you refer to largely address the wrap.sh and jobN.sh scripts i.e. use job arrays.

To address your other concern ie, how to check/wait for job to complete, see below.

To have job wait for another job to complete use the qsub option -hold_jid. To apply this to multiple jobs, each dependant on previous one to complete, my first thought would be a for loop. E.g.:

holdid=$(echo "<some code for job 1>" | qsub -terse)
for jobn in {1..99}
do
   holdid=$(echo "<some code for jobn>" | qsub -terse -hold_jid ${holdid})
done 

I am happy to edit this reply to help you out further.