0
votes

On our live system we've suddenly started encountering errors, where Service Fabric is failing to do failovers. The system was last deployed back in May and has been running fine since then. We have not installed any updates on the VMs. The error message is:

Error event: SourceId='System.RA', Property='ReplicaChangeRoleStatus'. Replica had multiple failures during change role on _stdNT_4. API call: IStatefulServiceReplica.ChangeRole(P); Error = System.Fabric.FabricObjectClosedException (-2147017730) The object is closed. System.Runtime.InteropServices.COMException (-2147017730) Exception from HRESULT: 0x80071BFE at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Services.Remoting.V1.FabricTransport.Runtime.FabricTransportServiceRemotingListener.<>c__DisplayClass10_0.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__26.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__18.MoveNext() For more information see: http://aka.ms/sfhealth

We also see System.Fabric.ServiceFabricException. An error occurred during this operation. Please check the trace logs for more information.

I can't find any other useful errors in the traces, or in the Event log on the VMs.

The only interesting thing that that the only service that is affected is the only one of our services that's stateful. We made it stateful in the last release so that we could use actor reminders.

Once the cluster starts failing, it will keep moving the primary from one node to another forever. We fixed the problem by re-deploying to a fresh cluster, but the problem came back a few days later.

I would like some advice on how we might be able to diagnose the problem, or if anyone has seen anything similar.

Using Service Fabric version 6.1.456, Asp.Net core version 1.1.2 and .net framework version 4.7.1.

1

1 Answers

0
votes

Stateful services have the concept of Primary and Secondary replicas for each partition.

That means, only the primary can handle any work(read write operations), while the secondary is used for replication of the state changes that occured in the primary.

When provisioning these replicas, SF make a call to the Primary receive to 'ChangeRole' and set it as primary replica, this will call OpenAsync() to open the listeners for calls and execute any work related to that replica.

When you do upgrades or the cluster rebalance your services, it will call ChangeRole again to demote a Primary to secondary, this will cancel the cancellation token(that your service received on OpenAsync) and it will close the listeners and you should stop any work happening in the service (like loops or blocking operations), the OnChangeRoleAsync is also called if it is overrided on your service.

The common mistake on this scenarios is that your code is not listening to the token cancellation or honoring the role change to stop any pending work, and this will cause your service to hang on Role Change, causing these failures.

If your service does not respond to these API calls in a reasonable amount of time, Service Fabric can forcibly terminate your service. Usually this only happens during application upgrades or when a service is being deleted. This timeout is 15 minutes by default.

Take a look on this docs for more info: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle#stateful-service-startup