I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?
Thank you very much.
dask.dataframe
and it scatters on the cluster. Otherwise if your data is partitioned (on S3 for example) again reading with dask every worker is getting a partition. - rpanaiwith joblib.parallel_backend('dask'): grid_search.fit(data.data, data.target)
I'm not sure if these codes automatically scatter the data from the local to the cluster. - dispink