6
votes

I have 20K message in SQS queue. I also have a lambda will process the SQS messages, and put data into ElasticSearch server.

I have configured SQS as the lambda's trigger, and limited the Lambda's SQS batch size to be 10. I also limited the only one instance of the lambda can be run at a giving time.

However, sometime I see over 10K in-flight messages from the AWS console. Should it be max at 10 in-flight messages?

Because of this, the lambdas will only able to process 9K of the SQS message properly.

Below is a screen capture to show that I have limited the lambda to have only 1 instance running at a giving time.

enter image description here

1
You're obviously not limiting the number of simultaneous Lambdas. By default, you're allowed to run 1000 simultaneously. This approximately matches the number of in flight messages you're seeing. I'd verify that you're limiting correctly. In general, single threading a Lambda is the opposite of what it was designed to do.stdunbar
@stdunbar I have added an image to show the current concurrency limit, and it's 1.user1187968
The documentation says that Lambda deletes the message "once your Lambda function successfully completes". I wonder if your function is not signalling that the function "successfully completed" (whatever that means)? Is it possible that your functions are timing-out before completing?John Rotenstein
@JohnRotenstein From the AWS Doc ...the handler can return a value...if you use the Event invocation type (asynchronous execution), the value is discarded. I checked the analytic, the lambda instances did NOT timeout.user1187968
@user1187968 did you get an answer? I'm facing the same problem, same configurationJoaquín Bucca

1 Answers

6
votes

I've been doing some testings and contacting AWS tech support at the same time.

What I do believe at the moment is that:

Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute. Doc

1/ The thing that does that pooling, is a separate entity. It is most likely to be a lambda function that will long-pool the SQS and then, invoke our lambda functions.

2/ That Pool-Lambda does not take into account any of our Receiver-Lambda at all. It does not care whether the function is running at max capacity or not, or how many max concurrency is available for the Receiver-Lambda

3/ Due to that combination. The behavior is not what we expected from the Lambda-SQS integration. And worse, If you have suddenly, millions of message burst in your queue. The Receiver-Lambda concurrency can never catch up with the amount of messages that the pooling-Lambda is sending, result in loss of work

The test:

  • Create one Lambda function that takes 30 seconds to return true;
  • Set that function's concurrency to 50;
  • Push 300 messages into the queue ( Visibility timeout : 10 Minutes, batch message count: 1, no re-drive )

The result:

  • Amount of messages available just increase gradually
  • At first, there are few enough messages to be processed by Receiver-Lambda
  • After half a minute, there are more messages available than what Receiver-Lambda can handle
  • These message would be discarded to dead queue. Due to Pool-Lambda unable to invoke Receiver-Lambda

I will update this answer as soon as I got the confirmation from AWS support

Support answer. As of Q1 2019, TL;DR version

1/ The assumption was correct, there was a "Poller"

2/ That Poller do not take into consideration of reserved concurrency as part of its algorithm

3/ That poller have hard limit of 1000

Q2-2019 :

The above information need to be updated. Support said that the poller correctly consider reserved concurrency but it should be at least 5. The SQS-Lambda integration is still being updated and this answer will not. So please consult AWS if you get into some weird issues