Why does SQS fail to process these messages?

Question

tl;dr: I'm trying to figure out what about the messages below could cause SQS to fail to process them and trigger the redrive policy which sends them to a Dead Letter Queue. The AWS documentation for DLQs says:

Sometimes, messages can’t be processed because of a variety of possible issues, such as erroneous conditions within the producer or consumer application or an unexpected state change that causes an issue with your application code. For example, if a user places a web order with a particular product ID, but the product ID is deleted, the web store's code fails and displays an error, and the message with the order request is sent to a dead-letter queue.

The context here is that my company uses a Cloud Formation setup to run a virus scanner against files which users upload to our S3 buckets.

The buckets have bucket events which publish PUT actions to an SQS queue.
An EC2 instance subscribes to that queue and runs files which get uploaded to those buckets through a virus scanner.

The messages which enter the queue are coming from S3 bucket events, so it seems like that rules out "erroneous conditions within the producer." Could an SQS redrive policy get fired if a subscriber to the queue fails to process the message?

This is one of the messages which was sent to the DLQ (I've changed letters and numbers in each of the IDs):

{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2019-09-30T20:21:13.762Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "AWS:AIDAIQ6ZKWSHYT34HC0X2"
      },
      "requestParameters": {
        "sourceIPAddress": "52.161.96.193"
      },
      "responseElements": {
        "x-amz-request-id": "9F500CA65B966D84",
        "x-amz-id-2": "w1R6BLPAI68na+xNssfdscQjfOQk56gmof+Bp4nF/rY90jBWnlqliHLrnwHWx20329clJckCIzhI="
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "VirusScan",
        "bucket": {
          "name": "uploadcenter",
          "ownerIdentity": {
            "principalId": "A2CSGHOAZOCNTU"
          },
          "arn": "arn:aws:s3:::sharingcenter"
        },
        "object": {
          "key": "Packard/f43edeee-6d58-118f-f8b8-4ec57f9cdb54Transformers/Transformers.mp4",
          "size": 1317070058,
          "eTag": "4a828a976dbdfe6fe1931f8e96437e2",
          "sequencer": "005D20633476B28AE7"
        }
      }
    }
  ]
}

I've been puzzling over this message and similar ones trying to figure out what may have triggered the redrive policy. Could it have been caused by the EC2 instance failing to process the message? There's nothing in Ruby script on the instance which would publish a message to the DLQ. Each of these files is uncommonly large. Is it possible that something in the process choked on the file because of its size, and that caused the redrive? If it's not possible for the EC2 failure to have caused the redrive, what is it about the message which would cause SQS to send it to the DLQ?

John Rotenstein John Rotenstein · Accepted Answer · 2019-10-15T23:54:19

Amazon SQS is typically used as follows:

Something publishes a message to a queue (in your case, an S3 PUT event)
Worker(s) request a message from the queue and process the message
- The message becomes "invisible" so that other workers cannot see it
- If the message was processed successfully, the worker tells SQS to delete the message
- If the worker does not respond within the invisibility timeout period, then SQS puts the message back on the queue
If a message fails more than a configured number of times (that is, if the workers do not delete the message), then the message is moved to a nominated Dead Letter Queue

Please note that there are no "subscribers" to SQS queues. Rather, applications call the SQS API and request a message.

The fact that you are getting messages in the DLQ indicates that the worker (virus checker) is not deleting the message within the invisibility period.

It is possible that the virus checker requires more time to scan large files, in which case you could increase the invisibility timeout on the queue to give it more time.

The workers can also signal back to SQS that they are still working on the message, which will refresh the timeout. This will need some modification to the virus checker to send such a signal at regular intervals.

Bottom line: The worker (virus checker) is not completing the task within the timeout period.

Why does SQS fail to process these messages?

1 Answers