0
votes

I've a standard AWS SQS queue and have multiple EC2 instances(~2K) actively polling that queue in an interval of 2 seconds. I'm using the AWS Java SDK to poll the queue and using the ReceiveMessageRequest with a single message in response for each request.

My expectation is that the number of in flight messages that shown in the SQS console is the number of messages received by the consumers and not yet deleted from queue(i.e it is the number of active messages under process in an instant). But The problem is that the Number of in flight messages is very much less than the number of consumers I've at an instant. As I mentioned I've ~2K consumers but I only see In-flight messages count in aprox. 300-600 range.

Is my assumption is wrong that the in-flight messages is equal to the number of messages currently under process. Also is there any limitation in the SQS/ EC2 or the SQS Java SDK that limits the number of messages that can be processed in an instant?

2
Your assumptions are correct. Are you using a Standard queue or a FIFO queue?John Rotenstein
@JohnRotenstein Its a standard queueMaster Po

2 Answers

2
votes

Generally speaking, as the number of consumers goes up, the number of messages in flight will go up as well - and each consumer can request unto 10 messages per read request - but in reality if each consumer alwaysrequests 10, they will get anywhere from 0-10 messages, especially when the number of messages is low and the number of consumers is high.

So your thinking is more or less correct, but you can't accurately predict precisely how many messages are in flight at any given time based on the number of consumers currently running, but there is a non-precise correlation between the two.

2
votes

This might point to a larger than expected amount of time that your hosts are NOT actively processing messages.

From your example of 2000 consumers polling at an interval of 2s, but only topping out at 600 in flight messages - some very rough math (600/2000=0.3) would indicate your hosts are only spending 30% of their time actually processing. In the simplest case, this would happen if a poll/process/delete of a message takes only 600ms, leaving average of 1400ms of idle time between deleting one message and receiving the next.

A good pattern for doing high volume message processing is to think of message processing in terms of thread pools - one for fetching messages, one for processing, and one for deleting (with a local in-memory queue to transition messages between each pool). Each pool has a very specific purpose, and can be more easily tuned to do that purpose really well:

  • Have enough fetchers (using the batch ReceiveMessage API) to keep your processors unblocked
  • Limit the size of the in-memory queue between fetchers and processors so that a single host doesn't put too many messages in flight (blocking other hosts from handling them)
  • Add as many processor threads as your host can handle
  • Keep metrics on how long processing takes, and provide ability to abort processing if it exceeds a certain time threshold (related to visibility timeout)
  • Use enough deleters to keep up with processing (also using the batch DeleteMessage API)

By recording metrics on each stage and the in-memory queues between each stage, you can easily pinpoint where your bottlenecks are and fine-tune the system further.

Other things to consider:

  • Use long polling - set WaitTimeSeconds property in the ReceiveMessage API to minimize empty responses
  • When you see low throughput, make sure your queue is saturated - if there are very few items in the queue and a lot of processors, many of those processors are going to sit idle waiting for messages.
  • Don't poll on an interval - poll as soon as you're done processing the previous messages.
  • Use batching to request/delete multiple messages at once, reducing time spent on round-trip calls to SQS