I have a Stateful service with 1000 partitions and 1 replica.
This service in the RunAsync method have an infinte while cycle where I call a Reliable Queue to get messages. If there are no messages I wait 5 seconds, then retry. I used to do exactly that with Azure Storage Queue with success.
But with Service Fabric I'm getting thousands of FabricNotReadableExceptions, the Service become unstable and I'm not able to update it or delete it, I need to cancel the entire cluster. I tried to update it and after 18 hours it was still stuck, so there is something terribly wrong in what I'm doing.
This is the method code:
public async Task<QueueObject> DeQueueAsync(string queueName)
{
var q = await StateManager.GetOrAddAsync<IReliableQueue<string>>(queueName);
using (var tx = StateManager.CreateTransaction())
{
try
{
var dequeued = await q.TryDequeueAsync(tx);
if (dequeued.HasValue)
{
await tx.CommitAsync();
var result = dequeued.Value;
return JSON.Deserialize<QueueObject>(result);
}
else
{
return null;
}
}
catch (Exception e)
{
ServiceEventSource.Current.ServiceMessage(this, $"!!ERROR!!: {e.Message} - Partition: {Partition.PartitionInfo.Id}");
return null;
}
}}
This is the RunAsync
protected override async Task RunAsync(CancellationToken cancellationToken)
{
while (true)
{
var message = await DeQueueAsync("MyQueue");
if (message != null)
{
//process, takes around 500ms
}
else
{
Thread.Sleep(5000);
}
}
}
I also changed Thread.Sleep(5000) with Task.Delay and was having thousands of "A task was canceled" errors.
What I'm missing here? It's the cycle too fast and SF cannot update the other replicas in time? Should I remove all the replicas leaving just one?
Should I use the new ConcurrentQueue instead?
I have the problem in production and in local with 50 or 1000 partitions, doesn't matter.
I'm stuck and confused. Thanks