This might point to a larger than expected amount of time that your hosts are NOT actively processing messages.
From your example of 2000 consumers polling at an interval of 2s, but only topping out at 600 in flight messages - some very rough math (600/2000=0.3
) would indicate your hosts are only spending 30% of their time actually processing. In the simplest case, this would happen if a poll/process/delete of a message takes only 600ms, leaving average of 1400ms of idle time between deleting one message and receiving the next.
A good pattern for doing high volume message processing is to think of message processing in terms of thread pools - one for fetching messages, one for processing, and one for deleting (with a local in-memory queue to transition messages between each pool). Each pool has a very specific purpose, and can be more easily tuned to do that purpose really well:
- Have enough fetchers (using the batch ReceiveMessage API) to keep your processors unblocked
- Limit the size of the in-memory queue between fetchers and processors so that a single host doesn't put too many messages in flight (blocking other hosts from handling them)
- Add as many processor threads as your host can handle
- Keep metrics on how long processing takes, and provide ability to abort processing if it exceeds a certain time threshold (related to visibility timeout)
- Use enough deleters to keep up with processing (also using the batch DeleteMessage API)
By recording metrics on each stage and the in-memory queues between each stage, you can easily pinpoint where your bottlenecks are and fine-tune the system further.
Other things to consider:
- Use long polling - set WaitTimeSeconds property in the ReceiveMessage API to minimize empty responses
- When you see low throughput, make sure your queue is saturated - if there are very few items in the queue and a lot of processors, many of those processors are going to sit idle waiting for messages.
- Don't poll on an interval - poll as soon as you're done processing the previous messages.
- Use batching to request/delete multiple messages at once, reducing time spent on round-trip calls to SQS