2
votes

We're trying to start a dask cluster using ECS on AWS. Our current setup:

  • Two services - a dask-scheduler service and a dask-worker service, each with a task definition. Each service has one task (in the future the dask-worker task can scale out).
  • The dask-scheduler maps ports 8786, 8787, & 9786 from the container to the host. The dask-worker task maps no ports.
  • A classic load balancer sits in front of the dask-scheduler and listens on those three ports on TCP. Even though we only have one dask-scheduler task, the load balancer provides a static address across scheduler restarts.
  • The dask-worker is started with the arg of the load balancer. The dask-scheduler is started with no args.

Unfortunately, I'm not having much luck. I'm getting these log messages:


06:10:24
distributed.core - INFO - Connection from 172.31.35.94:49003 to Scheduler

06:10:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49003)

06:10:24
distributed.core - INFO - Close connection from 172.31.35.94:49003 to Scheduler

06:10:54
distributed.core - INFO - Connection from 172.31.35.94:49009 to Scheduler

06:10:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49009)

06:10:54
distributed.core - INFO - Close connection from 172.31.35.94:49009 to Scheduler

06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49018 to Scheduler

06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49019 to Scheduler

06:11:07
distributed.scheduler - INFO - Receive client connection: 941a5c1a-8ac2-11e6-a74c-0242ac110001

06:11:24
distributed.core - INFO - Connection from 172.31.35.94:49023 to Scheduler

06:11:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49023)

06:11:24
distributed.core - INFO - Close connection from 172.31.35.94:49023 to Scheduler

06:11:54
distributed.core - INFO - Connection from 172.31.35.94:49033 to Scheduler

06:11:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49033)

06:11:54
distributed.core - INFO - Close connection from 172.31.35.94:49033 to Scheduler

I think it's an issue with the load balancer. When I run the same setup with static IPs, it works fine.

Any ideas why this should be a problem? I've tried running with --no-nanny mode, I've tried passing the load balancer address to --host on the scheduler, to no avail.

2
First, cool setup. I'm quite interested to see where this goes. I personally have no suggestions here other than to ensure that the ports you need open are open and that everyone can see each other in the network.MRocklin
Thanks @MRocklin. Do you know if the workers would need any ports mapped? And is it anything to do with the http ports? I couldn't find any documentation on thoseMaximilian
After leaving the scheduler running and idle for a while, I get three of these every five seconds: distributed.core - INFO - Collecting unused streams. open: 512, active: 0Maximilian
Maxmillian: any chance you got this working? Would be curious to hear what the issue was (cc @jalessio).kuanb
I didn't, unfortunately. I think it was something to do with the load balancer only being compatible with traffic going in one direction.Maximilian

2 Answers

0
votes

I've been struggling with the same issue and here's what I've found.

You have to run the ECS Tasks in awsvpc network mode to get ECS to assign a unique IP address to each docker container it starts. If you look at the error messages you can see the workers are connecting from addresses that are internal to docker

distributed.core - INFO - Connection from 172.31.35.94:49023 to Scheduler

that 172.31.35.94 ip does not exist on the network the AWS instance runs on, it is internal to docker - but the docker container run on different machines so the scheduler can't find the worker on that address. I haven't found a way of telling dask-worker the external address of the aws instance that runs the container.

So, the only option I've found is to run all the tasks in awsvpc network mode in which case ECS assigns a private IP of the form 192.168.0.0/24 to each container. The problem with this is that you can't connect to the bokeh dashboard anymore as the container IP address is now private.

So you'll need to then additionally run some NAT service to tunnel traffic from the public web to your scheduler.


You'll need to create a security group (let's call it dask) and open the dask ports (8786 and the ephemeral ports) on that security group at least for the subnet the containers run in, then launch the scheduler and worker tasks using that security group.

Notice in the logs below that the workers are connecting from dynamic ports above 35000, this means that the security group must keep these ports open at least within the subnet. You could optionally configure each worker to connect from a specific port by using the --worker-port and then open only that one.

The logs from the container running the scheduler should look something like the following enter image description here

0
votes

This is definitely a networking issue that is preventing communication between the instance and ECS. For the the load balancer health checks to pass, your dask-scheduler security group must allow inbound traffic on the ports specified. Confirm the items below:

What’s your VPC subnet? Is is same as the one used by ECS instances?

With dynamic IP can you confirm end-to-end communication of worker-scheduler at layer 2 or 3?

If you do a curl to the service ports, do you get a valid response?

Can you confirm that you have a valid and working security group with correct mapping?

Lastly do container agent service runs fine?

Best AWS ECS tasks and EC2 instance networking design guidance is available on AWS Git Developer docs.