On our live system we've suddenly started encountering errors, where Service Fabric is failing to do failovers. The system was last deployed back in May and has been running fine since then. We have not installed any updates on the VMs. The error message is:
Error event: SourceId='System.RA', Property='ReplicaChangeRoleStatus'.
Replica had multiple failures during change role on _stdNT_4. API call: IStatefulServiceReplica.ChangeRole(P); Error = System.Fabric.FabricObjectClosedException (-2147017730)
The object is closed.
System.Runtime.InteropServices.COMException (-2147017730)
Exception from HRESULT: 0x80071BFE
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Services.Remoting.V1.FabricTransport.Runtime.FabricTransportServiceRemotingListener.<>c__DisplayClass10_0.<b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__26.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__18.MoveNext()
For more information see: http://aka.ms/sfhealth
We also see System.Fabric.ServiceFabricException. An error occurred during this operation. Please check the trace logs for more information.
I can't find any other useful errors in the traces, or in the Event log on the VMs.
The only interesting thing that that the only service that is affected is the only one of our services that's stateful. We made it stateful in the last release so that we could use actor reminders.
Once the cluster starts failing, it will keep moving the primary from one node to another forever. We fixed the problem by re-deploying to a fresh cluster, but the problem came back a few days later.
I would like some advice on how we might be able to diagnose the problem, or if anyone has seen anything similar.
Using Service Fabric version 6.1.456, Asp.Net core version 1.1.2 and .net framework version 4.7.1.