0
votes

We boot up a cluster of 250 worker nodes in AWS at night to handle some long-running distributed tasks.

The worker nodes are running celery with the following command:

celery -A celery_worker worker --concurrency=1 -l info -n background_cluster.i-1b1a0dbb --without-heartbeat --without-gossip --without-mingle -- celeryd.prefetch_multiplier=1

We are using rabbitmq as our broker, and there is only 1 rabbitmq node.

About 60% of our nodes claim to be listening, but will not pick up any tasks.

Their logs look like this:

 -------------- celery@background_cluster.i-1b1a0dbb v3.1.18 (Cipater)
---- **** -----
--- * ***  * -- Linux-3.2.0-25-virtual-x86_64-with-Ubuntu-14.04-trusty
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         celery_worker:0x7f10c2235cd0
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     disabled
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ----
--- ***** ----- [queues]
 -------------- .> background_cluster exchange=root(direct) key=background_cluster


[tasks]
  . more.celery_worker.background_cluster

[2015-10-10 00:20:17,110: WARNING/MainProcess] celery@background_cluster.i-1b1a0dbb
[2015-10-10 00:20:17,110: WARNING/MainProcess] consuming from
[2015-10-10 00:20:17,110: WARNING/MainProcess] {'background_cluster': <unbound Queue background_cluster -> <unbound Exchange root(direct)> -> background_cluster>}
[2015-10-10 00:20:17,123: INFO/MainProcess] Connected to amqp://our_server:**@10.0.11.136:5672/our_server
[2015-10-10 00:20:17,144: WARNING/MainProcess] celery@background_cluster.i-1b1a0dbb ready.

However, rabbitmq shows that there are messages waiting in the queue.

If I login to any of the worker nodes and issue this command:

celery -A celery_worker inspect active

...then every (previously stalled) worker node immediately grabs a task and starts cranking.

Any ideas as to why?

Might it be related to these switches?

--without-heartbeat --without-gossip --without-mingle
1

1 Answers

2
votes

It turns out that this was a bug in celery where using --without-gossip kept events from draining. Celery's implementation of gossip is pretty new, and it apparently implicitly takes care of draining events, but when you turn it off things get a little wonky.

The details to the issue are outlined in this github issue: https://github.com/celery/celery/issues/1847

Master currently has the fix in this PR: https://github.com/celery/celery/pull/2823

So you can solve this one of three ways: