5
votes

I'm working with a long running parfor loop in matlab.

parfor iter=1:1000
   chunk_of_work(iter);
end

There are generally about 2-3 timing outliers per run. That is to say for every 1000 chunks of work performed there are 2-3 that take about 100 times longer than the rest. As the loop nears completion, the workers that evaluated the outliers continue to run while the rest of the workers have no computational load.

This is consistent with the parfor loop distributing work statically. This is in contrast with the documentation for the parallel computing toolbox found here:

"Work distribution is dynamic. Instead of being allocated a fixed iteration range, the workers are allocated a new iteration only after they finish processing their current iteration, which results in an even work load distribution."

Any ideas about what's going on?

4

4 Answers

5
votes

I think the doc you quote has a pretty good description what is considered a static allocation of work: each worker "being allocated a fixed iteration range". For 4 workers, this would mean the first being assigned iter 1:250, the second iter 251:500,... or the 1:4:100 for the first, 2:4:1000 for the second and so on.

You did not say exactly what you observe, but what you describe is well consistent with dynamic workload distribution: First, the four (example) workers work on one iter each, the first one that is finished works on a fifth, the next one that is done (which may well be the same if three of the first four take somewhat longer) works on a sixth, and so on. Now if your outliers are number 20, 850 and 900 in the order MATLAB chooses to process the loop iterations and each take 100 times as long, this only means that the 21st to 320th iterations will be solved by three of the four workers while one is busy with the 20th (by 320 it will be done, now assuming roughly even distribution of non-outlier calculation time). The worker being assigned the 850th iteration will, however, continue to run even after another has solved #1000, and the same for #900. In fact, if there were about 1100 iterations, the one working on #900 should be finished roughly at the time when the others are.

[edited as the orginal wording implied MATLAB would still assign the iterations of the parfor loop in order from 1 to 1000, which should not be assumed]

So long story short, unless you find a way to process your outliers first (which of course requires you to know a priori which ones are the outliers, and to find a way to make MATLAB start the parfor loop processing with these), dynamic workload distribution alone cannot avoid the effect you observe.

Addition: I think, however, that your observation that as "the loop nears completion, the worker*s* that evaluated the outliers continue to run" seems to imply at least one of the following

  1. The outliers somehow are among the last iterations MATLAB starts to process
  2. You have many workers, in the order of magnitude of the number of iterations
  3. Your estimate of the number of outliers (2-3) or your estimate of their computation time penalty (factor 100) is too low
3
votes

The work distribution in PARFOR is somewhat deterministic. You can observe precisely what's going on by having each worker log to disk how things go, but basically it turns out that PARFOR divides your loop up into chunks in a deterministic way, but farms them out dynamically. Unfortunately, there's currently no way to control that chunking.

However, if you cannot predict which of your 1000 cases are going to be outliers, it's hard to imagine an efficient scheme for distributing the work.

If you can predict your outliers, you might be able to take advantage of the fact that roughly speaking, PARFOR executes loop iterations in reverse order, so you could put them at the "end" of the loop so work starts on them immediately.

3
votes

The problem you face is well described in @arne.b's answer, I have nothing to add to that.

But, the parallel compute toolbox does contain functions for decomposing a job into tasks for independent execution. From your question it's not possible to conclude either that this is suitable or that this is not suitable for your application. If it is, the general strategy is to break the job into tasks of some size and have each processor tackle a task, when finished go back to the stack of unfinished tasks and start on another.

You might be able to decompose your problem such that one task replaces one loop iteration (lots of tasks, lots of overhead in managing the computation but best load-balancing) or so that one task replaces N loop iterations (fewer tasks, less overhead, poorer load-balancing). Jobs and tasks are a little trickier to implement than parfor too.

0
votes

As an alternative to PARFOR, in R2013b and later, you can use PARFEVAL and divide up the work any way you see fit. You could even cancel the 'timing outliers' once you've got sufficient results, if that's appropriate. There is, of course, overhead when dividing up your existing loop into 1000 individual remote PARFEVAL calls. Perhaps that's a problem, perhaps not. Here's the sort of thing I'm imagining:

for idx = 1:1000
     futures(idx) = parfeval(@chunk_of_work, 1, idx);
end
done = false; numComplete = 0;
timer = tic();
while ~done
    [idx, result] = fetchNext(futures, 10); % wait up to 10 seconds
    if ~isempty(idx)
        numComplete = numComplete + 1;
        % stash result
    end
    done = (numComplete == 1000) || (toc(timer) > 100);
end
% cancel outstanding work, has no effect on completed futures
cancel(futures);