We're trying to start a dask cluster using ECS on AWS. Our current setup:
- Two services - a dask-scheduler service and a dask-worker service, each with a task definition. Each service has one task (in the future the dask-worker task can scale out).
- The dask-scheduler maps ports 8786, 8787, & 9786 from the container to the host. The dask-worker task maps no ports.
- A classic load balancer sits in front of the dask-scheduler and listens on those three ports on TCP. Even though we only have one dask-scheduler task, the load balancer provides a static address across scheduler restarts.
- The dask-worker is started with the arg of the load balancer. The dask-scheduler is started with no args.
Unfortunately, I'm not having much luck. I'm getting these log messages:
06:10:24
distributed.core - INFO - Connection from 172.31.35.94:49003 to Scheduler
06:10:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49003)
06:10:24
distributed.core - INFO - Close connection from 172.31.35.94:49003 to Scheduler
06:10:54
distributed.core - INFO - Connection from 172.31.35.94:49009 to Scheduler
06:10:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49009)
06:10:54
distributed.core - INFO - Close connection from 172.31.35.94:49009 to Scheduler
06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49018 to Scheduler
06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49019 to Scheduler
06:11:07
distributed.scheduler - INFO - Receive client connection: 941a5c1a-8ac2-11e6-a74c-0242ac110001
06:11:24
distributed.core - INFO - Connection from 172.31.35.94:49023 to Scheduler
06:11:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49023)
06:11:24
distributed.core - INFO - Close connection from 172.31.35.94:49023 to Scheduler
06:11:54
distributed.core - INFO - Connection from 172.31.35.94:49033 to Scheduler
06:11:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49033)
06:11:54
distributed.core - INFO - Close connection from 172.31.35.94:49033 to Scheduler
I think it's an issue with the load balancer. When I run the same setup with static IPs, it works fine.
Any ideas why this should be a problem? I've tried running with --no-nanny
mode, I've tried passing the load balancer address to --host
on the scheduler, to no avail.
distributed.core - INFO - Collecting unused streams. open: 512, active: 0
– Maximilian