7
votes

The AWS SQS -> Lambda integration allows you to process incoming messages in a batch, where you configure the maximum number you can receive in a single batch. If you throw an exception during processing, to indicate failure, all the messages are not deleted from the incoming queue and can be picked up by another lambda for processing once the visibility timeout has passed.

Is there any way to keep the batch processing, for performance reasons, but allow some messages from the batch to succeed (and be deleted from the inbound queue) and only leave some of the batch un-deleted?

3

3 Answers

15
votes

The problem with manually re-enqueueing the failed messages to the queue is that you can get into an infinite loop where those items perpetually fail and get re-enqueued and fail again. Since they are being resent to the queue their retry count gets reset every time which means they'll never fail out into a dead letter queue. You also lose the benefits of the visibility timeout. This is also bad for monitoring purposes since you'll never be able to know if you're in a bad state unless you go manually check your logs.

A better approach would be to manually delete the successful items and then throw an exception to fail the rest of the batch. The successful items will be removed from the queue, all the items that actually failed will hit their normal visibility timeout periods and retain their receive count values, and you'll be able to actually use and monitor a dead letter queue. This is also overall less work than the other approach.

Considerations

  • Only override the default behavior if there has been a partial batch failure. If all the items succeeded, let the default behavior take its course
  • Since you're tracking the failures of each queue item, you'll need to catch and log each exception as they come in so that you can see what's going on later
3
votes

One option is to manually send back the failed messages to the queue, and then replying with a success to the SQS so that there are no duplicates.

You could do something like setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 (10 being the max batch size you can get from SQS -> Lambda event) then you can individually send back the failed messages to the queue, and then reply with a success message.

Additionally, to avoid any possible infinite retry loop, add a property to the event such as a "retry" count before sending it back to the queue, and drop the event when "retry" is greater than X.

1
votes

I recently encountered this problem and the best way to handle this without writing any code from our side is to use the FunctionResponseTypes property of EventSourceMapping. Using this we just have to pass the list of failed message Id and the event source will handle to delete the successful message. Please checkout Using SQS and Lambda

Cloudformation template to configure Eventsource for lambda

"FunctionEventSourceMapping": {
  "Type": "AWS::Lambda::EventSourceMapping",
  "Properties": {
    "BatchSize": "100",
    "Enabled": "True",
    "EventSourceArn": {"Fn::GetAtt":  ["SQSQueue", "Arn"]},
    "FunctionName": "FunctionName",
    "MaximumBatchingWindowInSeconds": "100",
    "FunctionResponseTypes": ["ReportBatchItemFailures"] # This is important
  }
}

After you configure your Event source with above configuration it should look something like below enter image description here

Then we just have to return the response in the below-mentioned format from our lambda

{"batchItemFailures": [{"itemIdentifier": "85f26da9-fceb-4252-9560-243376081199"}]}

Provide the list of failed message Ids in batchIntemFailures list If your lambda runtime environment is in python than please return dict in the above mentioned format for java based runtime you can use aws-lambda-java-event

Sample Python code enter image description here

Advantages of this approach are

  1. You don't have to add any code to manually delete the message from SQS queue
  2. You don't have to include any third party library or boto just for deleting the message from the queue it will help you to reduce your final artifact size.
  3. Keep it simple an stupid

On a side note make sure your lambda have the required permission on sqs to get and delete the message.

Thanks