0
votes

I have a Stateful service with 1000 partitions and 1 replica.

This service in the RunAsync method have an infinte while cycle where I call a Reliable Queue to get messages. If there are no messages I wait 5 seconds, then retry. I used to do exactly that with Azure Storage Queue with success.

But with Service Fabric I'm getting thousands of FabricNotReadableExceptions, the Service become unstable and I'm not able to update it or delete it, I need to cancel the entire cluster. I tried to update it and after 18 hours it was still stuck, so there is something terribly wrong in what I'm doing.

This is the method code:

public async Task<QueueObject> DeQueueAsync(string queueName)
        {
    var q = await StateManager.GetOrAddAsync<IReliableQueue<string>>(queueName);
        using (var tx = StateManager.CreateTransaction())
        {
            try
            {
                var dequeued = await q.TryDequeueAsync(tx);
                if (dequeued.HasValue)
                {
                    await tx.CommitAsync();
                    var result = dequeued.Value;
                    return JSON.Deserialize<QueueObject>(result);
                }
                else
                {
                    return null;
                }
            }
            catch (Exception e)
            {
                ServiceEventSource.Current.ServiceMessage(this, $"!!ERROR!!: {e.Message} - Partition: {Partition.PartitionInfo.Id}");
                return null;
            }
        }}

This is the RunAsync

    protected override async Task RunAsync(CancellationToken cancellationToken)
{
    while (true)
    {
        var message = await DeQueueAsync("MyQueue");
        if (message != null)
        {
            //process, takes around 500ms
        }
        else
        {
            Thread.Sleep(5000);
        }
    }
}

I also changed Thread.Sleep(5000) with Task.Delay and was having thousands of "A task was canceled" errors.

What I'm missing here? It's the cycle too fast and SF cannot update the other replicas in time? Should I remove all the replicas leaving just one?

Should I use the new ConcurrentQueue instead?

I have the problem in production and in local with 50 or 1000 partitions, doesn't matter.

I'm stuck and confused. Thanks

1

1 Answers

0
votes

You need to honor the cancellationToken that is passed in to your RunAsync implementation. Service Fabric will cancel the token when it wants to stop your service for any reason - including upgrades - and it will wait indefinitely for RunAsync to return after cancelling the token. This could explain why you couldn't upgrade your application.

I would suggest checking cancellationToken.IsCancelled inside your loop, and breaking out if it has been cancelled.

FabricNotReadableException can happen for a variety of reasons - the answer to this question has a comprehensive explanation, but the takeaway is

You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted.