21
votes

I have a Lambda with an SQS trigger. When it gets hit, a batch of records from SQS comes in (usually about 10 at a time, I think). If I return a failed status code from the handler, all 10 messages will be retried. If I return a success code, they'll all be removed from the queue. What if 1 out of those 10 messages failed and I want to retry just that one?

exports.handler = async (event) => {

    for(const e of event.Records){
        try {
            let body = JSON.parse(e.body);
            // do things
        }
        catch(e){
            // one message failed, i want it to be retried
        }        
    }

    // returning this causes ALL messages in 
    // this batch to be removed from the queue
    return {
        statusCode: 200,
        body: 'Finished.'
    };
};

Do I have to manually re-add that ones message back to the queue? Or can I return a status from my handler that indicates that one message failed and should be retried?

3
it's sad that there is still no simple way to handle such a case. - ysfaran

3 Answers

7
votes

Yes you have to manually re-add the failed messages back to the queue.

What I suggest doing is setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 then you can individually send back the failed messages to the queue.

2
votes

You've to programmatically delete each message from after processing it successfully.

So you can have a flag set to true if anyone of the messages failed and depending upon it you can raise error after processing all the messages in a batch so successful messages will be deleted and other messages will be reprocessed based on retry policies.

So as per the below logic only failed and unprocessed messages will get retried.

import boto3

sqs = boto3.client("sqs")

def handler(event, context):
    for message in event['records']:
        queue_url = "form queue url recommended to set it as env variable"
        message_body = message["body"]
        print("do some processing :)")
        message_receipt_handle = message["receiptHandle"]
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message_receipt_handle
        )

there is also another way to save successfully processed message id into a variable and perform batch delete operation based on message id

response = client.delete_message_batch(
    QueueUrl='string',
    Entries=[
        {
            'Id': 'string',
            'ReceiptHandle': 'string'
        },
    ]
)
1
votes

You need to design your app iin diffrent way here is few ideas not best but will solve your problem.

Solution 1:

Note :When an SQS event source mapping is initially created and enabled, or first appear after a period with no traffic, then the Lambda service will begin polling the SQS queue using five parallel long-polling connections, as per AWS documentation, the default duration for a long poll from AWS Lambda to SQS is 20 seconds.

Solution 2:

Use AWS StepFunction

StepFunction will call lambda and handle the retry logic on failure with configurable exponential back-off if needed.

**Solution 3: **

CloudWatch scheduled event to trigger a Lambda function that polls for FAILED.

Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events invokes your Lambda function asynchronously.