1
votes

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?

PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.

Image for: No. of workers = 16

Image for: No. of workers = 32

Image for: No. of workers = 64

1
I'd recommend generating a performance report: distributed.dask.org/en/latest/… In the report you can observe the administrative charts for workers and the scheduler. Perhaps there is something which jumps out and helps explain why no work is happening (long writes to disk/serialization/etc). If you don't see anything post the html to raw.githack.com paste the link here - quasiben

1 Answers

0
votes

As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.

This can be caused by many reasons. I'll list a few below:

  1. Very short tasks (sub millisecond)
  2. Algorithms that are not very parallelizable
  3. Objects in the task graph that are expensive to serialize
  4. ...

Looking at your images I don't think that any of these apply to you.

Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:

for i in ...:
    results = dask.compute(...) # do some dask work
    next_inputs = ...  # do some local work

So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).

I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html