1
votes

[Update below]

We're encountering an issue whereby 'System.Fabric.FabricNotPrimaryException' is thrown very regularly from a new service that we're developing.

The data flow is:

  • Message is published to a particular partition by modding a unique identifier and looking up the correct partition id. This is confirmed to be working correctly so we know that the data is going to the correct partition.
  • Message is stored in a Reliable Queue for future processing. This step appears to work correctly and we do not receive SF exceptions at this stage
  • Message is dequeued and handed to a handler that decides whether state in a ReliableDictionary needs to be updated. If it does, the update is processed using a 'rety' helper within a transaction. This is where the error occurs, and is thrown back to the main service class

Tracing a particular instance of an exception through the flow shows that:

  • Message reached the correct partition
  • Message was added to reliable storage on the partition
  • Message was handled
  • FabricNotPrimaryException was thrown when trying to update existing state
  • Partition id and node name logged with the recorded exception both match the details of the Primary replica for the expected partition when viewing the SF cluster.

This last point makes the 'FabricNotPrimaryException' very confusing to me so i was wondering if there is more information that i can log to prove conclusively whether this is the root cause or not?

Additionally are there any other scenarios, bar the obvious, that cause this exception to be thrown?

This is the stack from an example of the exception:

System.Fabric.FabricNotPrimaryException: at System.Fabric.Store.TStore5.ThrowIfNotWritable (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Fabric.Store.TStore5+d__218.MoveNext (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2+<GetOrAddAsync>d__109.MoveNext (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager+<>c__DisplayClass32_0+<b__2>d.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Resiliency.Retry.RetryHelper+<>c__DisplayClass2_01+<<ExecuteInTransaction>b__0>d.MoveNext (Resiliency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Resiliency.Retry.RetryHelper+<ExecuteInTransaction>d__21.MoveNext (Resiliency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager+<ApplyUpdate>d__32.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter
1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.Handlers.UpdateSystemEventHandler+d__7.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.CoreEngineProcessor+d__25.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null)

Additional info: 24/04/2018

We have been able to recreate this on a local cluster with a debugger attached. It seems that:

  • Flow works perfectly upon initial 'clean' deploy, regardless of load
  • After a cluster rebalance (triggered via a service upgrade on Azure cluster and by restarting nodes on local cluster) sporadic NotPrimary exceptions start to get thrown. These only seem to be throw when calling 'AddOrUpdate' on a reliable dictionary - writes to a reliable queue work as expected. When inspecting one of these exceptions in the debugger it shows the replica as primary and PartitionInfo.ReadStatus as Granted and the same for PartitionInfo.WriteStatus
  • Once the cluster has repaired itself and the cluster manager shows all partitions in a healthy state, a large percentage of future messages cause the same exception with the exact same state.

This is stack from one of these exceptions captured as an event:

"Timestamp": "2018-04-24T18:03:02.4053087+01:00", "ProviderName": "Clients-CoreEngineSvc-SAMPLE_CLIENT", "Id": 8, "Message": "'CoreEngineProcessor - OnProcessorMessage,' Exception: On processor: [Primary]. ReadStatus: [Granted]. WriteStatus: [Granted]", "ProcessId": 20732, "Level": "Error", "Keywords": "0x0000F00000000080", "EventName": "ServiceException", "ActivityID": null, "RelatedActivityID": null, "Payload": { "serviceName": "fabric:/Clients.Generic.App/CoreEngineSvc", "serviceTypeName": "CoreEngineSvcType", "partitionId": "6ee32f92-d94e-4cba-b4d1-7ce335625c9c", "applicationName": "fabric:/Clients.Generic.App", "applicationTypeName": "Clients.Generic.AppType", "nodeName": "_Node_0", "operationClass": "CoreEngineProcessor", "operationMethod": "OnProcessorMessage", "exceptionMessage": "", "unWrappedException": "Microsoft.ServiceFabric.Data.Impl ::: ::: at System.Fabric.Store.TStore5.ThrowIfNotWritable(Int64 tracer) at System.Fabric.Store.TStore5.d__224.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2.<AddOrUpdateAsync>d__98.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager.d__40.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at Clients.CoreEngine.Generic.CoreEngineProcessor.d__28.MoveNext() ", "exceptionString": "System.Fabric.FabricNotPrimaryException at System.Fabric.Store.TStore5.ThrowIfNotWritable(Int64 tracer) at System.Fabric.Store.TStore5.d__224.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2.<AddOrUpdateAsync>d__98.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager.d__40.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at Clients.CoreEngine.Generic.CoreEngineProcessor.d__28.MoveNext()", "info": "On processor: [Primary]. ReadStatus: [Granted]. WriteStatus: [Granted]", "correlationId": "00000000-0000-0000-0000-000000000000", "fixtureId": 8173405

1
Worth noting that we were using v2.7.198 initially, but found an issue that said there was a potential issue in this version that meant messages could potentially be sent to secondary replicas. We are now on v6.0.380 with no observable changes in behaviourAlex Zevenbergen
'Message reached the correct partition'. How have you determined this? Are you seeing different behaviors when running locally as opposed to, say, in a staging environment?KnowHoper
Our hashing is deterministic so we can calculate the expected partition a message should go to based on the assigned key in SF explorer. We're logging partition Id with our exceptions/other info logs. We see this behaviour in multiple environments. I thought i had tracked it down by adding logging to the function passed into the .GetOrAdd method on the reliable dictionary. It appeared that the passed function was always being called directly before the exception appeared. Initially this function could occasionally be long running, so is it possible that a timeout threshold is being breached?Alex Zevenbergen

1 Answers

0
votes

Overall your case sounds like a combination of too frequent or incorrect resource management balancing config coupled with not moving to the correct location for the replica once the exception has fired.

You can always get FabricNotPrimaryException even after you have reached the current Primary and are doing work. The Read and Write status flags are racy (for a lot of the same reasons that FabricNotReadableException can show up and should be retried). Those flags are there mainly to help people bail out early if they're about to do some long work (ie: Transaction Query Plan Generation, or rebuilding some index) so that they can check and bail before they even start. They're a useful optimization but don't tell you anything other than the moment at which they are checked.

When you see FabricNotPrimaryException you should just bail out and return the call to the client so that it knows that it was talking to the wrong replica and can re-resolve and find the new Primary and pick up where it left off.

For the most part when you see NotPrimary, one of two things has happened:

  1. The primary has just been demoted. This accounts for the race conditions you see - When the client called and said "give me the primary", or when you checked the write status, this was the primary, but while doing work that status is revoked. Normally this happens because the primary is moving to some other node. Normal things that cause movement are nodes coming up or down, other services getting created or deleted, or application upgrades. If you're not doing upgrades, it's almost certainly cluster resource management doing rebalancing and moving things around. If you're seeing too much movement, it's good to dig into that configuration and see if maybe you should be using different metrics, balancing thresholds, or a completely different balancing strategy in order to reduce unnecessary movement. By default SF uses just the default metrics and tries to keep things perfectly balanced so you can definitely get disruptive levels of balancing that interrupts your work if the environment is dynamic at all.

  2. The underlying Service Fabric lease for the node has expired. This node is about to get brought down. When the lease at the federation layer fails, all primaries immediately lose their status so that they can't do more work (since they're not technically in the cluster). This is less common, but can account for Primaries losing status when there's no movement in the cluster.