12
votes

We are using the following method in a Stateful Service on Service-Fabric. The service has partitions. Sometimes we get a FabricNotReadableException from this peace of code.

public async Task HandleEvent(EventHandlerMessage message)
{
    var queue = await StateManager.GetOrAddAsync<IReliableQueue<EventHandlerMessage>>(EventHandlerServiceConstants.EventHandlerQueueName);
    using(ITransaction tx = StateManager.CreateTransaction())
    {
      await queue.EnqueueAsync(tx, message);
      await tx.CommitAsync();
    }
}

Does that mean that the partition is down and is being moved? Of that we hit a secondary partition? Because there is also a FabricNotPrimaryException that is being raised in some cases.

I have seen the MSDN link (https://msdn.microsoft.com/en-us/library/azure/system.fabric.fabricnotreadableexception.aspx). But what does

Represents an exception that is thrown when a partition cannot accept reads.

mean? What happened that a partition cannot accept a read?

3
msdn.microsoft.com/en-us/library/azure/… Google is your friend on this oneTheLethalCoder
@TheLethalCoder that does not make it any clearer :(Michiel Overeem

3 Answers

15
votes

Under the covers Service Fabric has several states that can impact whether a given replica can safely serve reads and writes. They are:

  • Granted (you can think of this as normal operation)
  • Not Primary
  • No Write Quorum (again mainly impacting writes)
  • Reconfiguration Pending

FabricNotPrimaryException which you mention can be thrown whenever a write is attempted on a replica which is not currently the Primary, and maps to the NotPrimary state.

FabricNotReadableException maps to the other states (you don't really need to worry or differentiate between them), and can happen in a variety of cases. One example is if the replica you are trying to perform the read on is a "Standby" replica (a replica which was down and which has been recovered, but there are already enough active replicas in the replica set). Another example is if the replica is a Primary but is being closed (say due to an upgrade or because it reported fault), or if it is currently undergoing a reconfiguration (say for example that another replica is being added). All of these conditions will result in the replica not being able to satisfy writes for a small amount of time due to certain safety checks and atomic changes that Service Fabric needs to handle under the hood.

You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted. If you get FabricNotPrimary exception, generally this should be thrown back to the client (or the client in some way notified) that it needs to re-resolve in order to find the current Primary (the default communication stacks that Service Fabric ships take care of watching for non-retriable exceptions and re-resolving on your behalf).

There are two current known issues with FabricNotReadableException.

  1. FabricNotReadableException should have two variants. The first should be explicitly retriable (FabricTransientNotReadableException) and the second should be FabricNotReadableException. The first version (Transient) is the most common and is probably what you are running into, certainly what you would run into in the majority of cases. The second (non-transient) would be returned in the case where you end up talking to a Standby replica. Talking to a standby won't happen with the out of the box transports and retry logic, but if you have your own it is possible to run into it.
  2. The other issue is that today the FabricNotReadableException should be deriving from FabricTransientException, making it easier to determine what the correct behavior is.
1
votes

Posted as an answer (to asnider's comment - Mar 16 at 17:42) because it was too long for comments! :)

I am also stuck in this catch 22. My svc starts and immediately receives messages. I want to encapsulate the service startup in OpenAsync and set up some ReliableDictionary values, then start receiving message. However, at this point the Fabric is not Readable and I need to split this "startup" between OpenAsync and RunAsync :(

RunAsync in my service and OpenAsync in my client also seem to have different Cancellation tokens, so I need to work around how to deal with this too. It just all feels a bit messy. I have a number of ideas on how to tidy this up in my code but has anyone come up with an elegant solution?

It would be nice if ICommunicationClient had a RunAsync interface that was called when the Fabric becomes ready/readable and cancelled when the Fabric shuts down the replica - this would seriously simplify my life. :)

0
votes

I was running into the same problem. My listener was starting up before the main thread of the service. I queued the list of listeners needing to be started, and then activated them all early on in the main thread. As a result, all messages coming in were able to be handled and placed into the appropriate reliable storage. My simple solution (this is a service bus listener):

public Task<string> OpenAsync (CancellationToken cancellationToken)
{
  string uri;

  Start ();
  uri = "<your endpoint here>";
  return Task.FromResult (uri);
}

public static object lockOperations = new object ();
public static bool operationsStarted = false;
public static List<ClientAuthorizationBusCommunicationListener> pendingStarts = new List<ClientAuthorizationBusCommunicationListener> ();
public static void StartOperations ()
{
  lock (lockOperations)
  {
    if (!operationsStarted)
    {
      foreach (ClientAuthorizationBusCommunicationListener listener in pendingStarts)
      {
        listener.DoStart ();
      }
      operationsStarted = true;
    }
  }
}

private static void QueueStart (ClientAuthorizationBusCommunicationListener listener)
{
  lock (lockOperations)
  {
    if (operationsStarted)
    {
      listener.DoStart ();
    }
    else
    {
      pendingStarts.Add (listener);
    }
  }
}

private void Start ()
{
  QueueStart (this);
}

private void DoStart ()
{
  ServiceBus.WatchStatusChanges (HandleStatusMessage,
    this.clientId,
    out this.subscription);
}

========================

In the main thread, you call the function to start listener operations:

protected override async Task RunAsync (CancellationToken cancellationToken)
{
  ClientAuthorizationBusCommunicationListener.StartOperations ();

...

This problem likely manifested itself here as the bus in question already had messages and started firing the second the listener was created. Trying to access anything in state manager was throwing the exception you were asking about.