Confusion about the data location when applying Scikit-learn on cluster (Dask)

Question

I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?

Thank you very much.

I'm not sure if I'm getting it correctly. But if the data is on your scheduler you read with dask.dataframe and it scatters on the cluster. Otherwise if your data is partitioned (on S3 for example) again reading with dask every worker is getting a partition. — rpanai
For me, it makes more sense to read data via dask.dataframe. Like you said, the data will then scatter on the cluster, where the computation taking place. However, the tutorial confuses me because it uses simple pd.dataframe instead of dask.dataframe and distribute it by with joblib.parallel_backend('dask'): grid_search.fit(data.data, data.target) I'm not sure if these codes automatically scatter the data from the local to the cluster. — dispink

rikturr rikturr · Accepted Answer · 2020-07-10T00:55:55

If your data is small enough (which it is in the tutorial), and preprocessing steps are rather trivial, then it is okay to read in with pandas. This will read the data in to your local session, not yet any of the dask workers. Once you call with joblib.parallel_backend('dask'), the data will be copied to each worker process and the scikit work will be done there.

If your data is large or you have intensive preprocessing steps its best to "load" the data with dask, and then use dask's built-in preprocessing and grid search where possible. In this case the data will actually be loaded directly from the workers, because of dask's lazy execution paradigm. Dask's grid search will also cache repeated steps of the cross validation and can speed up computation immensely. More can be found here: https://ml.dask.org/hyper-parameter-search.html

Confusion about the data location when applying Scikit-learn on cluster (Dask)

1 Answers