I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.
I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.
- 1 worker occupies, 4 GB memory, 2.2GB GPU memory. max workers I can give is 5. (Because of GPU memory 2.2 GB * 5 workers = 11 GB )
- 1 worker = 1 HTTP request (max simultaneous requests = 5)
The specific question is
- How can I increase the concurrency/parallelism?
- Do I have to specify number of threads for computation on GPU?
Now my gunicorn command is
gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000