0
votes

I have written a bash script to continually run jobs to generate large quantities of simulation data.

Essentially once the script is run, it should continually launch background processes to generate data subject to the constraint that no more than 32 simultaneous background jobs can be run. This is required to prevent processes gobbling up all available ram and stalling the server.

My idea was to launch bash functions in the background and store the PID of those jobs. Then after 32 jobs have been launched, use the wait command to wait until all PIDs of jobs have finished executing.

I think wait is the correct tool to use here as so long as the pid of a process exists when the wait command is run (which it will because the simulations take 6 hours to run) then the wait command will detect the process exiting.

This seems like a better option than just polling processes and checking for the existence of a particular PID as PIDs are recycled, and another process could have been started after ours finished with the same PID. (Just by random chance, if we are unlucky.)

However, using the wait method has the drawback that if processes do not exit in the order they were run, then wait will be called for a PID which no longer exists unless a new process re-used the same PID as the one we recorded earlier, and in addition, if one job takes significantly longer than the others (again by chance) then we will be waiting for one job to end while there is room on the system for another 31 jobs, which cannot be run because we are waiting for that final PID to exit...

This is probably becoming a bit hard to visualize so let me add some code...

I am using a while loop as the foundation of this "algorithm"

c=0 # count total number of jobs launched (will not really use this here)
PIDS=() # keep any array of PIDs

# maximum number of simultaneous jobs and counter
BATCH_SIZE=32
BATCH_COUNT=0

# just start looping
while true

    # edit: forgot to add this initially
    # just check to see if a job has been run using file existance
    if [ ! -e "$FILE_NAME_1" ]
    then

        # obvious
        if [ "$BATCH_COUNT" -lt "$BATCH_SIZE" ]
        then

            (( BATCH_COUNT += 1 ))

            # this is used elsewhere to keep track of whether a job has been executed (the file existence is a flag)    
            touch "$FILE_NAME_1"
            # call background job, parallel_job_run is a bash function
            parallel_job_run $has_some_arguments_but_not_relevent
            # get PID
            PID=$!
            echo "[ JOB ] : Launched job as PID=$PID"
            PIDS+=($PID)

            # count total number of jobs
            ((c=c+1))
        fi

    else
        # increment file name to use as that file already exists        
        # the "files" are for input/output
        # the details are not particularly important
    fi

    true # prevent exit

# the following is a problem 
do      
    if (( BATCH_COUNT < BATCH_SIZE ))
    then
        continue
    else
        # collect launched jobs
        # this does not collect jobs in the order that they finish
        # it will first wait for the first PID in the array to exit
        # however this job may be the last to finish, in which case
        # wait will be called with other array values with PID's which
        # have already exited, and hence it is undefined behaviour
        # as to whether we wait for a PID which doesn't exist (no problem)
        # or a new process may have started which re-uses our PID
        # and therefore we are waiting for someone else's process
        # to finish which is nothing to do with our own jobs!
        # we could be waiting for the PID of someone else's tty login
        # for example!
        for pid in "${PIDS[@]}"
        do
            wait $pid || echo "failed job PID=$pid"
            (( BATCH_COUNT -= 1 ))
        done
    fi

done 

Hopefully the combination of comments and above code and comments in the code should make it clear what I am attempting to do.

My other idea was to replace the for loop at the end with another loop which continually checks whether each of the PID's exist. (Polling.) This could be combined with sleep 1 to prevent CPU hogging. However the problem with this is as before, our process may exit releasing it's PID and another process may happen to be run which acquires that PID. The advantage of this method is that we will never wait more than about 1 second before a new process is launched when a previous one exits.

Can anyone give me any advice on how to proceed with the problems I am having here?

I will continually update this question today - for example by adding new information if I find any and by formatting it / rewording sections to make it clearer.

1

1 Answers

1
votes

If you use -n option with wait, it will wait for the next process to finish, regardless of its PID. So, that could be one solution.

Also, Linux does not recycle the PID immediately as you seem to imply. It assigns the next available PID to the new process in order and starts from the beginning only after it has exhausted the maximum available PID.