During a load test of the application (with dynamic load reporting services) the whole application stopped working because one replica of a stateful partition gives a warning.
Warning System.RAP IStatefulServiceReplica.ChangeRole(S)Duration Thu, 21 Jul 2016 3:36:03 GMT Infinity 131135817636324745 false false Start Time (UTC): 2016-07-21 13:35:43.632
This happens after a load balancing of the replica, this happened to the 4th replica of the partition eventhough we only target 3. So even if SF just kills it the application should be fine (As the primary and 2 other secondaries are up). However the whole thing jams. (from logging I can see at least 10k events still need to be processed but the whole thing stops)
In the images above you can see the details of the particular replica. The only differences between this replica and the other Secondary replica's is in the following values:
- Read Status
- Write Status
- Current Service Operation
- Queue Memory Size (in Replication Queue)
- First Sequence Number (in Replication Queue)
- Last Replication Operation Received Time Utc
- Last Copy Operation Received Time Utc
- Last Acknowledgement Sent Time Utc
I also find it odd that the Replica Status says: Ready and not Reconfiguring anymore. As the read/write status says it is still reconfiguring I'm running the newest SDK (2.1.163, released 18-07-2016). I thought the bugfix was in there but eventhough it became much harder to reproduce it still occurred. Does anyone know what might be causing this or how to fix this?
edit: Screenshot of the failing partition
Edit: Results of debugging, based on the answer of Vaclav (22-7-2016)
After the response of Vaclav I started to log everything in the RunAsync to determine what was actually causing the problem. So what part of the code did not exit if the cancellation was requested. As Vaclav pointed out the method did not stop when the cancellation was requested. However it seems like the code section in which it gets stuck is native Service Fabric.
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
await tx.CommitAsync();
}
The queue is a ReliableQueue, the time-out is set to the default 4 seconds and the cancelationtoken is from RunAsync. After adding the logging between each line we got the following logging pattern
//pre transaction
using(ITransaction tx = StateManager.CreateTransaction())
{
//pre dequeue
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
//dequeued
await tx.CommitAsync();
//committed
}
//post transaction
At each line I logged the value of the cancelationrequest aswell, and a background task would log when the cancelation request was fired. As a result we got for example this:
pre transaction: False
predequeue: False
dequeued: False
CancelationTokenFired: True
The precise location could vary but the last log before CancelationTokenFired was always
- pre transaction
- predequeue
- dequeued
As stated before this is done on the most recent SDK (18-7-2016) which supposedly had a bug-fix for a similar problem. The problem also occurred on the older SDK and even more frequent back then. But even on the new version it is still reproducable each run.