SQS task going in DLQ despite being successful in Lambda + also when deleted manually

Question

I have built my own application around AWS Lambda and Salesforce. I have around 10 users using my internal app, so not talkiing about big usage.

Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.

This is working perfectly.

Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8

I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic. Everything is working well.

My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).

The thing I don't get is why is this going on like this ?

Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.

So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ... Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).

What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?

Example of my async processor

def process_data(event, context):
    """
    By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
    - python_module: use to determine the location of the method to call asynchronously
    - python_function: use to determine the location of the method to call asynchronously
    - uuid: uuid to get the params stored in dynamodb
    """

    print('Start Processing Async')

    client = boto3.client('sqs')
    queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']

    # batch size = 1 so only record 1 to process
    for record in event['Records']:
        try:
            kwargs = json.loads(record['body'])

            print(f'Start Processing Async Data Record:\n{kwargs}')
            
            python_module = kwargs['python_module']
            python_function = kwargs['python_function']

            # CALLING THE FUNCTION WE WANTED ASYNC, AND DOING ITS STUFF... (WORKING OK)
            getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)

            
            print('End Processing Async Data Record')
            res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
            print(f'End Deleting Async Data Record with status: {res}')  # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.


        except Exception:
            # set expire to 0 so that the task goes into DLQ
            client.change_message_visibility(
                QueueUrl=queue_url,
                ReceiptHandle=record['receiptHandle'],
                VisibilityTimeout=0
            )


        utils.raise_exception(f'There was a problem during async processing. Event:\n'
                              f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')

Example of today's bug with logs from CloudWatch: Initial event:

{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}

Res (after calling delete_message):

End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}

BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy

I have a similar problem with a CFN stack I'm working on currently. If I keep spamming the queue some of the messages do go through but only about 5/10 will succeed and the rest will go to DLQ. — TResponse

yeye yeye · Accepted Answer · 2021-05-31T13:22:05

OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.

Hence the deletion doing anything since the task was already in the DLQ when the deletion was made

SQS task going in DLQ despite being successful in Lambda + also when deleted manually

1 Answers