tl;dr: I'm trying to figure out what about the messages below could cause SQS to fail to process them and trigger the redrive policy which sends them to a Dead Letter Queue. The AWS documentation for DLQs says:
Sometimes, messages can’t be processed because of a variety of possible issues, such as erroneous conditions within the producer or consumer application or an unexpected state change that causes an issue with your application code. For example, if a user places a web order with a particular product ID, but the product ID is deleted, the web store's code fails and displays an error, and the message with the order request is sent to a dead-letter queue.
The context here is that my company uses a Cloud Formation setup to run a virus scanner against files which users upload to our S3 buckets.
- The buckets have bucket events which publish
PUT
actions to an SQS queue. - An EC2 instance subscribes to that queue and runs files which get uploaded to those buckets through a virus scanner.
The messages which enter the queue are coming from S3 bucket events, so it seems like that rules out "erroneous conditions within the producer." Could an SQS redrive policy get fired if a subscriber to the queue fails to process the message?
This is one of the messages which was sent to the DLQ (I've changed letters and numbers in each of the IDs):
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventTime": "2019-09-30T20:21:13.762Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AWS:AIDAIQ6ZKWSHYT34HC0X2"
},
"requestParameters": {
"sourceIPAddress": "52.161.96.193"
},
"responseElements": {
"x-amz-request-id": "9F500CA65B966D84",
"x-amz-id-2": "w1R6BLPAI68na+xNssfdscQjfOQk56gmof+Bp4nF/rY90jBWnlqliHLrnwHWx20329clJckCIzhI="
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "VirusScan",
"bucket": {
"name": "uploadcenter",
"ownerIdentity": {
"principalId": "A2CSGHOAZOCNTU"
},
"arn": "arn:aws:s3:::sharingcenter"
},
"object": {
"key": "Packard/f43edeee-6d58-118f-f8b8-4ec57f9cdb54Transformers/Transformers.mp4",
"size": 1317070058,
"eTag": "4a828a976dbdfe6fe1931f8e96437e2",
"sequencer": "005D20633476B28AE7"
}
}
}
]
}
I've been puzzling over this message and similar ones trying to figure out what may have triggered the redrive policy. Could it have been caused by the EC2 instance failing to process the message? There's nothing in Ruby script on the instance which would publish a message to the DLQ. Each of these files is uncommonly large. Is it possible that something in the process choked on the file because of its size, and that caused the redrive? If it's not possible for the EC2 failure to have caused the redrive, what is it about the message which would cause SQS to send it to the DLQ?