3
votes

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.

I'm able to start up the scheduler on the first machine:

enter image description here

I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001, but almost immediately the worker complains about a timeout loop.

enter image description here

From the master node, in my Jupyter notebook I then connect to the scheduler:

dask_client = Client('10.201.101.108:31001')

But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False) consumes every core on the machine).

1

1 Answers

1
votes

It is not uncommon to see the "Event loop was unresponsive" wanring when first connecting, depending on your network.

Some things to check

  1. client.get_versions(check=True)
  2. Does client.scheduler_info()['workers'] have anything? If not then you might have some trouble connecting
  3. Consider looking at the worker logs with client.get_worker_logs()
  4. Try running a simple computation like client.submit(lambda x: x + 1, 10).result()