1
votes

We're seeing some odd behaviour while processing a long-running message on an Azure Service Bus queue. We are renewing the lock as we run over a fairly large dataset and this particular message generally takes around 1 hour to complete. The process works great and runs through to completion, unless it is running over midnight UTC.

Bang on midnight UTC, our process throws an exception:

Microsoft.ServiceBus.Messaging.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue.

We are able to reproduce this night after night, and it never happens if we don't run the process over midnight.

Could it be that the message lock "ExpiresAtUtc" timestamp calculation doesn't handle crossing over from one day to another very gracefully?

---- UPDATE ----

Some more information that might help:

The long running process keeps the message from becoming visible again on the queue by calling RenewLock. When this processing continues over midnight we notice that the message becomes visible on the queue and processing is restarted. The message is not dropped and does not move to the dead letter queue. It's lock simply expires and the message becomes visible on the queue again so it is picked up by a processor and the process is restarted. As long as the process does not cross the boundary of midnight UTC it completes successfully.

Here are snippets of code we're using to connect / enqueue / dequeue from the queues:

Connect to queue:

private QueueClient GetQueue<TMessage>() => QueueClient.CreateFromConnectionString(this.configSection.Value.ConnectionString, typeof(TMessage).Name, ReceiveMode.PeekLock);

Enqueue message:

using (var brokeredMessage = new BrokeredMessage(message) {ContentType = "application/json"})
{
    await GetQueue<TMessage>().SendAsync(brokeredMessage).ConfigureAwait(false);
}

Dequeue message:

GetQueue<TMessage>().OnMessageAsync(
    async msg =>
    {
        TMessage body = null;
        try
        {
            body = msg.GetBody<TMessage>();

            await handler.HandleMessageAsync(body, msg.RenewLockAsync).ConfigureAwait(false);
            await msg.CompleteAsync().ConfigureAwait(false);
        }
        catch (Exception ex)
        {
            await msg.AbandonAsync().ConfigureAwait(false);
        }
    },
    new OnMessageOptions
    {
        AutoComplete = false
    }
); 

Below is a screenshot of the Azure Service Bus Metrics - Successful Requests, over midnight UTC (I'm in a UTC+1 timezone so that's 1:00 AM where I am) showing how at midnight something affecting the queue happens:

enter image description here

Below is a screenshot of our internal logging, the processing just stops and then starts up again after just over a minute - when the lock expires and the message becomes visible on the queue again so is picked up by the processor:

enter image description here

2
Could it be that the message lock "ExpiresAtUtc" timestamp calculation doesn't handle crossing over from one day to another very gracefully? Don't think so. Would be nice if you could share a simple repro.Sean Feldman
Hi @SeanFeldman - I've added a bit more info. Does this help?codefrenzy
Interesting. Looking at the chart, it seems like there's a disconnect for 2 minutes. Lock renewal is a client operation, so if client is failing to connect to the server, lock will be lost and message will become available to other instances. If lock duration is set to the maximum, 5 minutes, lock renewal will be performed earlier than that. What is the lock duration set on the queue? Also, what version of the ASB client are you using?Sean Feldman
@SeanFeldman ... lock duration is set to 1 minute and ASB client version is 4.1.6. We call RenewLock with every page of data processed (around 4-6 seconds) and the process completes fine provided it doesn't run over midnight. Its almost like the ASB ignores the RenewLock calls from 00:00 - 00:01 and then honours them again after that time (too late and causes a MessageLockLostException). We could try upping the lock timeout but it seems like a bandaid. Something happens at midnight and I hate throwing code at something until its fixed without understanding what was wrong in the first place.iDevForFun
@iDevForFun I don't know what to say other than would like to run a repro on my end to see if this behaviour can be reproduced in another environment/queue/namespace. Is there an option to put together a very minimal code repro of what you've got?Sean Feldman

2 Answers

1
votes

I am from the Azure Service Bus team. LockDuration is just duration of time, it is never internally represented as a timestamp. A message is technically locked for the LockDuration from the instant it is received, not locked until some instant in the future. For example, a message is locked for 30 seconds from the time it is received no matter what time it is. So midnight UTC is never a special case or any such thing. Our automated tests run every night and we haven't this case before. But since you are reproducing it every night, there must be something interesting going on. I don't yet know what it is. I would like to know the region where you are reproducing this. One of our team members will soon contact you.

0
votes

I'm more used to work with Topic/Subscriptions in ASB (huge scale) but as far as I know queues and Topics share the same internal infrastructure.

Basically, in our software, we use BrokeredMessage.RenewLock() as the way to keep a message locked for long running processes.

To do the work, and keep the message locked until the process ends, we start a separate Task, that every 30 seconds or so (depends on your queue "Lock Duration" configuration) we execute a brokeredMessage.RenewLock(), that is, we keep pinging the lock so is not eventually lost.

The error you are reporting is not local, is thrown by ASB when you try to 'complete' the message, deadletter it, or trying some other operation that require you to hold a lock on the message, you most likely lost the lock by timeout.

The code we use to keep the lock alive looks very similar to this:

  private void EngageAutoRenewLock(BrokeredMessage message, CancellationTokenSource renewLockCancellationTokenSource)
        {
            var renewalThread = Task.Factory.StartNew(() =>
            {
                Trace(string.Format("Starting lock renewal thread for MsgID {0}", message.MessageId));

                while (!renewLockCancellationTokenSource.IsCancellationRequested)
                {
                    // we use Sleep instead of Task.Delay to avoid get an exception when token is cancelled
                    System.Threading.Thread.Sleep(30000); 
                    // check again after sleeping
                    if (!renewLockCancellationTokenSource.IsCancellationRequested)
                    {
                        try
                        {
                            // renewlock wraps a RenewLockAsync() call
                            message.RenewLock();
                        }
                        catch (Exception ex)
                        {
                            /* Do Nothing. 

                                This exception can happen due the async nature of the RenewLock Operation (async Begin/End), it is possible we 
                                completed/abandoned/deadlettered the message in the main thread in the middle of the RenewLock operation. 

                                Additionally, we should keep retrying to renew lock until renewLockCancellationTokenSource has ben signaled. 
                                This will allow to handle other transient renewal issues. */

                            Trace(string.Format("Lock renewal failed due an exception for MsgID {0} - {1}", message.MessageId, ex.Message));
                        }
                        Trace(string.Format("Lock renewed MsgID {0}", message.MessageId));
                    }
                    else
                    {
                        Trace(string.Format("Lock renewed MsgID {0} not happened due IsCancellationRequested == true", message.MessageId));
                    }
                }
                Trace(string.Format("Lock renewal has been cancelled by cancellationToken (ok) for MsgID {0}", message.MessageId));
            }, renewLockCancellationTokenSource.Token);
        }