0
votes

I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.

I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.

  • 1 worker occupies, 4 GB memory, 2.2GB GPU memory. max workers I can give is 5. (Because of GPU memory 2.2 GB * 5 workers = 11 GB )
  • 1 worker = 1 HTTP request (max simultaneous requests = 5)

The specific question is

  1. How can I increase the concurrency/parallelism?
  2. Do I have to specify number of threads for computation on GPU?

Now my gunicorn command is

gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000

1

1 Answers

0
votes

fast Tokenizers are not thread-safe apparently.

AutoTokenizers seems like a wrapper that uses fast or slow internally. their default is set to fast (not thread-safe) .. you'll have to switch that to slow (safe) .. that's why add the use_fast=False flag

I was able to solve this by:

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

Best, Chirag Sanghvi