9
votes

We've got a Django 1.3 application with django-celery 2.5.5 that's been running fine in production for month, but all of the sudden one of the celery tasks are failing to execute now.

The RabbitMQ broker and Celery workers are running on a separate machine and celeryconfig.py is configured to use that particular RabbitMQ instance as the backend.

On the application server I tried to manually fire off the celery task via python manage.py shell.

The actual task is called like so:

>>> result = tasks.runCodeGeneration.delay(code_generation, None)
>>> result
<AsyncResult: 853daa7b-8be5-4a25-a1d0-1552b38a0d21>
>>> result.state
'PENDING'

It returns an AsyncResult as expected but its status is forever 'PENDING'.

To see if the RabbitMQ broker received the message I ran the following:

$ rabbitmqctl list_queues name messages messages_ready messages_unacknowledged | grep 853daa
853daa7b8be54a25a1d01552b38a0d21        0       0       0

I'm not sure what this means, RabbitMQ certainly seems to receive some sort of request, otherwise how else could a queue have been created for the task with id: 853daa7b8be54a25a1d01552b38a0d21. It just doesn't seem to hold any messages?

I've tried restarting both Celery and RabbitMQ and the problem persists.

Celery is run like so: $ python /home/[project]/console/manage.py celeryd -B -c2 --loglevel=INFO

Note that the celerybeat/scheduled tasks seem to be running just fine.

[EDIT]: There is no RabbitMQ configuration as it is being inlined by the init.d script: /usr/lib/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -noshell -noinput -sname rabbit@hostname -boot /var/lib/rabbitmq/mnesia/rabbit@hostname-plugins-expand/rabbit -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/[email protected]"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/[email protected]"} -os_mon start_cpu_sup true -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@hostname"

[EDIT2]: Here's the celeryconfig we're using for the workers. The same config is used for the producer except of course localhost is changed to the box with RabbitMQ broker on it.

from datetime import timedelta

BROKER_HOST = "localhost"

BROKER_PORT = 5672
BROKER_USER = "console"
BROKER_PASSWORD = "console"

BROKER_VHOST = "console"
BROKER_URL = "amqp://guest:guest@localhost:5672//"

CELERY_RESULT_BACKEND = "amqp"

CELERY_IMPORTS = ("tasks", )

CELERYD_HIJACK_ROOT_LOGGER = True
CELERYD_LOG_FORMAT = "[%(asctime)s: %(levelname)s/%(processName)s/%(name)s] %(message)s"


CELERYBEAT_SCHEDULE = {
    "runs-every-60-seconds": {
        "task": "tasks.runMapReduceQueries",
        "schedule": timedelta(seconds=60),
        "args": ()
    },
}

[EDIT3]: Our infrastructure is set up like number 2 below: Celery worker/broker & django-celery application

1
If I understand you correctly, you have only one task that is not executing while all the rest works fine ? If this is the case could be that the task is scheduled on a queue that is never consumed ? - Mauro Rocco
That is correct. Is there a way I can verify that what you say is happening? And what can I do to fix it? - sw00
If should check and maybe share here the 2 celeryconfig, the one on the servers where the worker are running and the one that you are using from console when scheduling the task. As you are using django-celery than the celeryconfig is the Django settings itself. Another thing you can do is to check rabbitmq management web console for queues that are always growing and never consumed, you can even check all the consumers subscribed to a queue. rabbitmq.com/management.html - Mauro Rocco
Thanks for the help. However, using management web console we can see that a queue is created and is in "Active" state but there are no messages in it... - sw00
Could be that messages are consumed immediately but this should be evident from other statistics. If you really want some help I suggest you to paste the settings, off course not with real IP and password :-) - Mauro Rocco

1 Answers

3
votes

We solved the problem.

There was a long-running celerybeat task (~430s) that was being scheduled to run every 60 seconds. This hogged all the workers in a queue.