I have a Celery broker running on a cloud server (Django app), and two workers on local servers in my office connected behind a NAT. The local workers frequently lose connection, and have to be restarted to re-establish connection with the broker. Usually celeryd restart hangs the first time I try it, so I have to ctr+C and retry once or twice to get it back up and connected. The workers' logs two most common errors:
[2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop
next(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop
cb(*cbargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable
reader(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read
raise ConnectionError('Socket was disconnected')
ConnectionError: Socket was disconnected
[2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
raise RecoverableConnectionError('connection already closed')
How do I go about debugging this? Is the fact that the workers are behind a NAT an issue? Is there a good tool to monitor whether the workers have lost connection? At least with that, I could get them back online by manually restarting the worker.