gathering a large dataframe back into master in dask distributed

Question

I have a large (~180K row) dataframe for which

df.compute()

hangs when running dask with the distributed scheduler in local mode on an AWS m5.12xlarge (98 cores). All the worker remain nearly idle However

df.head(df.shape[0].compute(), -1)

completes quickly, with good utilization of the available core.

Logically the above should be equivalent. What causes the difference? Is there some parameter I should pass to compute in the first version to speed it up?

MRocklin MRocklin · Accepted Answer · 2019-06-16T08:25:17

When you call .compute() you are asking for the entire result in your local process as a pandas dataframe. If that result is large then it might not fit. Do you need the entire result locally? If not then perhaps you wanted .persist() instead?

gathering a large dataframe back into master in dask distributed

1 Answers