1
votes

Background

This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:

https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html

I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.

A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.

It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.

Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.

Questions

1. Is the built-in AWS managed Lambda asyncronous queue reliable?

The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:

Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.

This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.

2. When a message is in the queue, what happens when the message timeout is exceeded?

I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.

3. Why would I use SQS over the Lambda queue when SQS presents other problems?

See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:

https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/

https://medium.com/@zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0

I can't find any articles or discussions of how the Lambda queue performs.

Thanks for reading!

2

2 Answers

0
votes

Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets

So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?

What if I vibrate the buckets at different frequencies?

Further reading:

TL;DR;

It depends

0
votes

I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.

Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.

Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.