2
votes

I've got a consumer that I suspect is taking longer than the default Message Visibility to process a given message, but is eventually succeeding.

  • If Consumer A gets receipt R1M1 for message M1
  • then the visibility timeout elapses
  • then Consumer B gets receipt R2M1 for message M1
  • then Consumer A calls deleteMessage(R1M1)

is the message deleted, or does it remain on the queue, since another consumer has a more-valid receipt for the message?

I'm observing that many of the more-complicated messages in my queue have many (50-1000) receipts, but I'm not logging any failures to process messages. I suspect that I'm successfully processing each message many times, and then the delete action is silently failing.

1

1 Answers

4
votes

The API Reference doumentation actually contradicts itself on the same page about this.

DeleteMessage

Deletes the specified message from the specified queue. You specify the message by using the message's receipt handle and not the MessageId you receive when you send the message.

Even if the message is locked by another reader due to the visibility timeout setting, it is still deleted from the queue.

This seems straightforward enough, until you keep reading.

Note

The receipt handle is associated with a specific instance of receiving the message. If you receive a message more than once, the receipt handle you get each time you receive the message is different. If you don't provide the most recently received receipt handle for the message when you use the DeleteMessage action, the request succeeds, but the message might not be deleted.

So, the answer to your question is "yes, absolutely, except no, not necessarily."

But it does explain why you'd have silent failures -- delete apparently doesn't fail, if the request is valid.

This is probably a fundamental artifact of the distributed nature of SQS -- if the particular node inside SQS that delivered the message were to fail, it could be that older message receipts might be lost. I'm speculating, of course.

Fundamentally, though, you do seem to have a design flaw if this is a situation you're encountering. You either send a subsequent request to increase the visibility timeout, or set your default visibility timeout high enough that it will never happen under normal conditions. The maximum value is 12 hours, which is far to long for most use cases.

Also, your consumer needs a way to verify whether the message has already been acted on.

Think of the visibility timeout as a retry timer.

An example from my infrastructure is a system that reacts to a file being dropped into a temporary staging bucket in S3. The queue consumer looks up the file, and does some database queries to determine which system or systems may want that file. It then copies the file to the target system(s) bucket(s), and depending on the rule, it may create database entries and/or send a message to a different queue for processing that file. This happens, normally, in a short few seconds, and if everything goes well, the message is deleted from the queue. If something goes wrong, it just forgets about the message and goes back to polling the queue.

The default visibility timeout for this queue is set to 5 minutes, which is much longer than the process normally takes, because that's how soon I want the message retried if processing fails. That's how you want to use the visibility timeout.

Note that the normal mode process would never require 5 minutes under standard processing conditions.

After 5 retries, SQS removes the message from the main queue, and drops the message instead into a Dead Letter Queue (you can choose the number, my setting is 5). This queue is consumed by a separate process that stores the message and alerts me to the fact that this message has exceeded its allowed number of receives and that it was never deleted -- indicating either a poison pill message or some kind of unhandled error or a chronic failure condition.