Service Fabric failed to reconfigure replica

Question

During a load test of the application (with dynamic load reporting services) the whole application stopped working because one replica of a stateful partition gives a warning.

Warning System.RAP IStatefulServiceReplica.ChangeRole(S)Duration Thu, 21 Jul 2016  3:36:03 GMT Infinity 131135817636324745 false false Start Time (UTC): 2016-07-21 13:35:43.632

This happens after a load balancing of the replica, this happened to the 4th replica of the partition eventhough we only target 3. So even if SF just kills it the application should be fine (As the primary and 2 other secondaries are up). However the whole thing jams. (from logging I can see at least 10k events still need to be processed but the whole thing stops)

In the images above you can see the details of the particular replica. The only differences between this replica and the other Secondary replica's is in the following values:

Read Status
Write Status
Current Service Operation
Queue Memory Size (in Replication Queue)
First Sequence Number (in Replication Queue)
Last Replication Operation Received Time Utc
Last Copy Operation Received Time Utc
Last Acknowledgement Sent Time Utc

I also find it odd that the Replica Status says: Ready and not Reconfiguring anymore. As the read/write status says it is still reconfiguring I'm running the newest SDK (2.1.163, released 18-07-2016). I thought the bugfix was in there but eventhough it became much harder to reproduce it still occurred. Does anyone know what might be causing this or how to fix this?

edit: Screenshot of the failing partition

Edit: Results of debugging, based on the answer of Vaclav (22-7-2016)

After the response of Vaclav I started to log everything in the RunAsync to determine what was actually causing the problem. So what part of the code did not exit if the cancellation was requested. As Vaclav pointed out the method did not stop when the cancellation was requested. However it seems like the code section in which it gets stuck is native Service Fabric.

using(ITransaction tx = StateManager.CreateTransaction())
{
  await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
  await tx.CommitAsync();
}

The queue is a ReliableQueue, the time-out is set to the default 4 seconds and the cancelationtoken is from RunAsync. After adding the logging between each line we got the following logging pattern

//pre transaction
using(ITransaction tx = StateManager.CreateTransaction())
{
  //pre dequeue
  await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
  //dequeued
  await tx.CommitAsync();
  //committed
}
//post transaction

At each line I logged the value of the cancelationrequest aswell, and a background task would log when the cancelation request was fired. As a result we got for example this:

pre transaction: False
predequeue: False
dequeued: False
CancelationTokenFired: True

The precise location could vary but the last log before CancelationTokenFired was always

pre transaction
predequeue
dequeued

As stated before this is done on the most recent SDK (18-7-2016) which supposedly had a bug-fix for a similar problem. The problem also occurred on the older SDK and even more frequent back then. But even on the new version it is still reproducable each run.

I dont' use the cancellation token for any of the reliable collection operations. Try without changing timeout or adding cancellation tokens. — Nick Randell
@Nick That is what we had before, but that did not work either. Also it should not matter what way you call it anyway :) — P. Gramberg
Is this on a proper cluster or a localhost development cluster? I may be seeing something similar on my localhost dev cluster. — Nick Randell
Interesting - I wonder if there is still a bug around this area, especially on local clusters. My azure clusters appear to have been working really well since the 163 update, but I'm still having these issues on the local one. I've also asked a question on the MSDN service fabric forum about this. social.msdn.microsoft.com/Forums/en-US/… — Nick Randell

Vaclav Turecek Vaclav Turecek · Accepted Answer · 2016-07-21T18:10:48

This warning means your service isn't exiting RunAsync when a primary replica of your service is changing role during reconfiguration (look at the health warning in your last screenshot). Make sure you honor that cancellation token in every possible code path. This also applies to communication listeners - make sure they are responding to CloseAsync().

Given what you're saying, here's what most likely happened:

We built a new replica on a new node (likely for load balancing). At this point, temporarily, you have 4 replicas until reconfiguration completes.
We attempt to swap primaries to this new replica.
Your current primary is told to change role, which means cancel RunAsync and close communication listeners.
Your current primary isn't completing its role change - RunAsync isn't exiting or your communication listeners aren't closing.
Reconfiguration is stuck waiting for the current primary to finish changing role.
Health warnings are issued.

Once reconfiguration completes, your replica set size will be reduced back to your target of 3.

We won't kill your slow replica because we don't know that your application will be fine - maybe it's taking a long time to safely process valuable data - we don't know. Service Fabric is very paranoid about safety and won't do anything that could possibly cause your service to lose data.

Service Fabric Explorer unfortunately doesn't show the reconfiguring state, it is showing you the expected end result. But if you run Get-ServiceFabricPartition in PowerShell, it will show you the reconfiguring state of the partition.

Service Fabric failed to reconfigure replica

Edit: Results of debugging, based on the answer of Vaclav (22-7-2016)

2 Answers