0
votes

I went through Dask tutorials and they always start with the initialization of the Client:

from dask.distributed import Client

client = Client(n_workers=4)

I am mostly interested in using Dask's read_csv function for reading DataFrames in parallel on my laptop.

import dask.dataframe as dd
df = dd.read_csv('trainset.csv').compute()

Despite setting n_workers=4, Dask uses all the cores when reading csv. It is the same if initialize the Client or not. Do I even need to initialize the Client when I using Dask locally and only for reading files? Is it initialized implicitly with Dask?

1

1 Answers

2
votes

Dask's default scheduler is the simple "threaded" scheduler which cannot run on multiple machines. However, if you create a distributed client, then that becomes the default instead - even if it is a "local" client running on just one machine. This is because the threaded scheduler came earlier and was already the default, and also because installing distributed requires many extra dependencies such as tornado. In some limited cases, the threaded scheduler can be faster too, because it is simpler, but the distributed one has more features and diagnostics, so it is generally recommended for all purposes.

The older single-machine schedulers: https://docs.dask.org/en/latest/setup/single-machine.html

The distributed scheduler, which can also be used on a single machine: https://docs.dask.org/en/latest/setup/single-distributed.html