1
votes

I've got a stateful service running in a Service Fabric cluster that I now know fails to honor a cancellation token passed into it. My fault.

I'm ready to release the fix, but during the upgrade process, I'm expecting the service replica on the faulty primary node to get stuck since it won't honor the token passed in.

I can use Restart-ServiceFabricDeployedCodePackage or even Restart-ServiceFabricNode to manually take down the stuck replica, but that will result in a brief service interruption during the upgrade process.

Is there any way to release this fix with zero downtime?

2
This is not possible for a stateful service, you will need to have downtime on the upgrade. Once you have a version that supports the cancellation token then you will be fine.Murray Foxcroft
I appreciate the feedback. Needed to know what my options were. If you want to post that as an answer, I'll accept it.Sam Schneider

2 Answers

2
votes

This is not possible for a stateful service using the Service Fabric infrastructure, you will need to have downtime on the upgrade. Once you have a version that supports the cancellation token then you will be fine.

That said, depending on the use of the state, and if you have a load balancer between your clients and the service, you can stand up another service instance on the new fixed version and use the load balancer to drain your traffic across to then new version, upgrade the old, drain back to it and then drop the second service you created. This will allow for a zero downtime scenario.

1
votes

The only workarounds I can think of are worse since they turn off parts of health checks during upgrades and "force" the process to come down. This doesn't make things more graceful or improve downtime, and has a side effect of potentially causing other health issues to be ignored.

There's always some downtime, even with the fully rolling upgrades, since swapping a primary to another node is never instantaneous and callers need to discover the new location. With those commands, you're just converting a more graceful shutdown and cleanup into a failure, which results in the same primary swap. Shouldn't be a huge difference since clients (and SF) have to deal with failure normally anyway.

I'd keep using those commands since they give you good manual control over which replicas/processes to poke when things get stuck.