I am new to dask and when setting up a dask distributed cluster not sure what is best practice to do so. When setting up the workers what would be more efficient, having two workers both with 4GB of memory or is it good to have 8 workers with 1GB of RAM? Does it vary depending on the data that's going to be processed? We have about 5-10GB of data in parquet format that needs to be processed. Can you suggest a common setup to start with? Also when the number of workers do we need to increase the memory of scheduler also?
1
votes
1 Answers
1
votes
It will depend on the type of function you will execute. If your function is in pure Python then it would be better to multiple workers otherwise the execution will be blocked by Python's GIL. On the other hand, if your functions are mainly using code that releases the GIL then having workers with multiple threads can be beneficial.