[Update below]
We're encountering an issue whereby 'System.Fabric.FabricNotPrimaryException' is thrown very regularly from a new service that we're developing.
The data flow is:
- Message is published to a particular partition by modding a unique identifier and looking up the correct partition id. This is confirmed to be working correctly so we know that the data is going to the correct partition.
- Message is stored in a Reliable Queue for future processing. This step appears to work correctly and we do not receive SF exceptions at this stage
- Message is dequeued and handed to a handler that decides whether state in a ReliableDictionary needs to be updated. If it does, the update is processed using a 'rety' helper within a transaction. This is where the error occurs, and is thrown back to the main service class
Tracing a particular instance of an exception through the flow shows that:
- Message reached the correct partition
- Message was added to reliable storage on the partition
- Message was handled
- FabricNotPrimaryException was thrown when trying to update existing state
- Partition id and node name logged with the recorded exception both match the details of the Primary replica for the expected partition when viewing the SF cluster.
This last point makes the 'FabricNotPrimaryException' very confusing to me so i was wondering if there is more information that i can log to prove conclusively whether this is the root cause or not?
Additionally are there any other scenarios, bar the obvious, that cause this exception to be thrown?
This is the stack from an example of the exception:
System.Fabric.FabricNotPrimaryException: at System.Fabric.Store.TStore
5.ThrowIfNotWritable (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Fabric.Store.TStore
5+d__218.MoveNext (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2+<GetOrAddAsync>d__109.MoveNext (Microsoft.ServiceFabric.Data.Impl, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter
1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager+<>c__DisplayClass32_0+<b__2>d.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Resiliency.Retry.RetryHelper+<>c__DisplayClass2_01+<<ExecuteInTransaction>b__0>d.MoveNext (Resiliency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at Resiliency.Retry.RetryHelper+<ExecuteInTransaction>d__2
1.MoveNext (Resiliency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
1.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager+<ApplyUpdate>d__32.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter
at Clients.CoreEngine.Generic.Handlers.UpdateSystemEventHandler+d__7.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089) at System.Runtime.CompilerServices.TaskAwaiter.GetResult (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
at Clients.CoreEngine.Generic.CoreEngineProcessor+d__25.MoveNext (Clients.CoreEngine.Generic, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null)
Additional info: 24/04/2018
We have been able to recreate this on a local cluster with a debugger attached. It seems that:
- Flow works perfectly upon initial 'clean' deploy, regardless of load
- After a cluster rebalance (triggered via a service upgrade on Azure cluster and by restarting nodes on local cluster) sporadic NotPrimary exceptions start to get thrown. These only seem to be throw when calling 'AddOrUpdate' on a reliable dictionary - writes to a reliable queue work as expected. When inspecting one of these exceptions in the debugger it shows the replica as primary and PartitionInfo.ReadStatus as Granted and the same for PartitionInfo.WriteStatus
- Once the cluster has repaired itself and the cluster manager shows all partitions in a healthy state, a large percentage of future messages cause the same exception with the exact same state.
This is stack from one of these exceptions captured as an event:
"Timestamp": "2018-04-24T18:03:02.4053087+01:00", "ProviderName": "Clients-CoreEngineSvc-SAMPLE_CLIENT", "Id": 8, "Message": "'CoreEngineProcessor - OnProcessorMessage,' Exception: On processor: [Primary]. ReadStatus: [Granted]. WriteStatus: [Granted]", "ProcessId": 20732, "Level": "Error", "Keywords": "0x0000F00000000080", "EventName": "ServiceException", "ActivityID": null, "RelatedActivityID": null, "Payload": { "serviceName": "fabric:/Clients.Generic.App/CoreEngineSvc", "serviceTypeName": "CoreEngineSvcType", "partitionId": "6ee32f92-d94e-4cba-b4d1-7ce335625c9c", "applicationName": "fabric:/Clients.Generic.App", "applicationTypeName": "Clients.Generic.AppType", "nodeName": "_Node_0", "operationClass": "CoreEngineProcessor", "operationMethod": "OnProcessorMessage", "exceptionMessage": "", "unWrappedException": "Microsoft.ServiceFabric.Data.Impl ::: ::: at System.Fabric.Store.TStore
5.ThrowIfNotWritable(Int64 tracer) at System.Fabric.Store.TStore
5.d__224.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2.<AddOrUpdateAsync>d__98.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter
1.GetResult() at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager.d__40.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at Clients.CoreEngine.Generic.CoreEngineProcessor.d__28.MoveNext() ", "exceptionString": "System.Fabric.FabricNotPrimaryException at System.Fabric.Store.TStore5.ThrowIfNotWritable(Int64 tracer) at System.Fabric.Store.TStore
5.d__224.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2.<AddOrUpdateAsync>d__98.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter
1.GetResult() at Clients.CoreEngine.Generic.StateManager.CoreEngineStateManager.d__40.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at Clients.CoreEngine.Generic.CoreEngineProcessor.d__28.MoveNext()", "info": "On processor: [Primary]. ReadStatus: [Granted]. WriteStatus: [Granted]", "correlationId": "00000000-0000-0000-0000-000000000000", "fixtureId": 8173405