0
votes

I've asked this question on Github also - https://github.com/Azure/service-fabric-issues/issues/379

I have (n) actors that are executing on a continuous reminder every second.

These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.

I cannot find any information relating to this exception being encountered by reliable actors.

Once this exception occurs and the actors are deactivated, they do not get re-activated.

What are the conditions for this error to occur and how can I further diagnose the problem?

"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()

Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:

Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false. Partition reconfiguration is taking longer than expected. fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d S/S IB _Node1_4 Up 131456742276273986 S/P RD _Node1_2 Up 131456742361691499 P/S RD _Node1_0 Down 131457861497316547 (Showing 3 out of 4 replicas. Total available replicas: 1.)

With a warning in the primary replica of that partition:

Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.

And a warning in the ActiveSecondary:

Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0

3 out of 5 Nodes are showing the following error:

Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.

More Information:

My cluster setup consists of 5 nodes of D1 virtual machines.

Event viewer errors in Microsoft-Service Fabric application:

I see quite a lot of

Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl. System.ComponentModel.Win32Exception (0x80004005): The handle is invalid at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime) at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs) at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)

and a heap of warnings like:

Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z

And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:

Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.

Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB. Should Service fabric really be persisting to Temporary Storage ?

1

1 Answers

2
votes

Re: the errors - yeah looks like disk full causing failures. Just to close the loop here - looks like you found out that your state wasn't actually getting distributed in the cluster, and once you fixed that you stopped seeing disk full. Your capacity planning should hopefully make more sense now.

Regarding safety: TLDR: Using the temporary drive is fine because you're using Service Fabric. If you weren't then using that drive for real data would be a very bad idea.

Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe, since Azure may Service heal the VM in response to a bunch of different things.

In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and try to stall the update in the meantime. Some more info on that is here.