I am using Celery with RabbitMQ to process data from API requests. The process goes as follows:
Request > API > RabbitMQ > Celery Worker > Return
Ideally I would spawn more celery workers but I am restricted by memory constraints.
Currently, the bottleneck in my process is fetching and downloading the data from the URLs passed into the worker. Roughy, the process looks like this:
def celery_gets_job(url):
data = fetches_url(url) # takes 0.1s to 1.0s (bottleneck)
result = processes_data(data) # takes 0.1s
return result
This is unacceptable as the worker is locked up for a while while fetching the URL. I am looking at improving this through threading, but I am unsure what the best practices are.
Is there a way to make the celery worker download the incoming data asynchronously while processing the data at the same time in a different thread?
Should I have separate workers fetching and processing, with some form of message passing, possibly via RabbitMQ?
celery_gets_job
has multiple non-atomic operations and this will create problems while using multithreading. You can use Queue where data is populated by pool of processes runningfetches_url(url)
and another process(es) to carry outprocesses_data(data)
– shrishinde