How can I scale down celery worker network overhead?

Question

I am running a web application that runs celery on AWS. However, all worker processes are running in a private data center (campus supercomputer). I have 34 separate worker processes running to consume jobs, the rabbitmq and Redis instances used for the broker and backend exist on AWS in my EC2 instance.

I was shocked last month to find out that, with no jobs submitted to the application, I still used nearly 700GB of network bandwidth (outgoing traffic only!) on my EC2 instance hosting rabbit and Redis. This traffic is entirely caused by celery worker overhead communication with the rabbit instance. There are nearly 17 messages/second being sent to each worker instance despite no actual compute jobs to process.

My tasks are long-running (at least multi-second, and sometimes multi-minute), heavy compute jobs, so high latency for task retrieval is totally acceptable--timescales on seconds is fine. Ideally, I'd like to tell my celery workers to just check in for new tasks once every few seconds and stop all other network overhead communication.

Is there a way to reduce the overall network overhead for celery workers?

Colton Hicks Colton Hicks · Accepted Answer · 2021-04-06T13:15:57

As per this article on CloudAMQP, in addition to other settings, they recommend running workers with the following three flags to limit worker chattiness:

celery -A my_celery_app worker --without-heartbeat --without-gossip --without-mingle

It appears this question, posed in other forms, has some discussion on stack overflow here.

The Celery documentation is miserably opaque on these flags. There is some information on heartbeats here.

After doing some profiling myself using iptraf (quick overview here), I was able to learn that the --without-gossip flag cuts celery overhead by 95%. It appears the gossip feature subscribes all workers to all other worker related events, like heartbeats and clock synchronization. This creates an N^2 scaling curve with N being the number of workers. This quickly creates tons of network chattiness and unless you have written your own code that leverages the gossip feature, it appears all this communication is totally unnecessary.

My profiling confirms that --without-mingle only reduces worker startup communication but has no impact on long-term network overhead.

Heartbeats do appear to consume some network overhead. Surprisingly, with 34 workers and a broker_heartbeat of 120 seconds and broker_heartbeat_checkrate of 3.0 I'd still generate 21GB/month (rough profiling) of network overhead just for these heartbeat checks. I'm not clear on the application implications of disabling them from the documentation--will the workers ever detect if the broker becomes unavailable? I did basic checks myself by killing my RabbitMQ instance and monitoring the worker logs (workers with --without-heartbeat passed). They appear to detect the loss of the broker very quickly (within a few seconds) and reconnect just fine as soon as I start the broker back up. So my basic observations suggest that the heartbeat is not required to maintain the generally expected behavior of celery workers. What's the point of the heartbeat anyway then? It's not clear to me.

The above three flags appear to drop all unnecessary worker network overhead by eliminating all superfluous message passing. It also appears that celery workers generally behave as expected without these feature enabled.

How can I scale down celery worker network overhead?

1 Answers