0
votes

I'm trying to find a workaround to the following limitation: When starting an Akka Cluster from scratch, one has to make sure that the first seed node is started. It's a problem to me, because if I have an emergency to restart all my system from scratch, who knows if the one machine everything relies on will be up and running properly? And I might not have the luxury to take time changing the system configuration. Hence my attempt to create the cluster manually, without relying on a static seed node list.

Now it's easy for me to have all Akka systems registering themselves somewhere (e.g. a network filesystem, by touching a file periodically). Therefore when starting up a new system could

  1. Look up the list of all systems that are supposedly alive (i.e. who touched the file system recently).
  2. a. If there is none, then the new system joins itself, i.e. starts the cluster alone. b. Otherwise it tries to join the cluster with Cluster(system).joinSeedNodes using all the other supposedly alive systems as seeds.
  3. If 2. b. doesn't succeed in reasonable time, the new system tries again, starting from 1. (looking up again the list of supposedly alive systems, as it might have changed in the meantime; in particular all other systems might have died and we'd ultimately fall into 2. a.).

I'm unsure how to implement 3.: How do I know whether joining has succeeded or failed? (Need to subscribe to cluster events?) And is it possible in case of failure to call Cluster(system).joinSeedNodes again? The official documentation is not very explicit on this point and I'm not 100% how to interpret the following in my case (can I do several attempts, using different seeds?):

An actor system can only join a cluster once. Additional attempts will be ignored. When it has successfully joined it must be restarted to be able to join another cluster or to join the same cluster again.

Finally, let me precise that I'm building a small cluster (it's just 10 systems for the moment and it won't grow very big) and it has to be restarted from scratch now and then (I cannot assume the cluster will be alive forever).

Thx

3

3 Answers

1
votes

I'm answering my own question to let people know how I sorted out my issues in the end. Michal Borowiecki's answer mentioned the ConstructR project and I built my answer on their code.

How do I know whether joining has succeeded or failed? After issuing Cluster(system).joinSeedNodes I subscribe to cluster events and start a timeout:

private case object JoinTimeout
...
Cluster(context.system).subscribe(self, InitialStateAsEvents, classOf[MemberUp], classOf[MemberLeft])
system.scheduler.scheduleOnce(15.seconds, self, JoinTimeout)

The receive is:

val address = Cluster(system).selfAddress
...
case MemberUp(member) if member.address == address =>
  // Hooray, I joined the cluster!
case JoinTimeout =>
  // Oops, couldn't join
  system.terminate()

Is it possible in case of failure to call Cluster(system).joinSeedNodes again? Maybe, maybe not. But actually I simply terminate the actor system if joining didn't succeed and restart it for another try (so it's a "let it crash" pattern at the actor system level).

0
votes

You don't need seed-nodes. You need seed nodes if you want the cluster to auto-start up.

You can start your individual application and then have them "manually" join the cluster at any point in time. For example, if you have http enabled, you can use the akka-management library (or implement a subset of it yourself, they are all basic cluster library functions just nicely wrapped).

I strongly discourage the touch approach. How do you sync on the touch reading / writing between nodes? What if someone reads a transient state (while someone else is writing it) ?

I'd say either go full auto (with multiple seed-nodes), or go full "manual" and have another system be in charge of managing the clusterization of your nodes. By that I mean you start them up individually, and they join the cluster only when ordered to do so by the external supervisor (also very helpful to manage split-brains).

0
votes

We've started using Constructr extension instead of the static list of seed-nodes:

https://github.com/hseeberger/constructr

This doesn't have the limitation of a statically-configured 1st seed-node having to be up after a full cluster restart.

Instead, it relies on a highly-available lookup service. Constructr supports etcd natively and there are extensions for (at least) zookeeper and consul available. Since we already have a zookeeper cluster for kafka, we went for zookeeper:

https://github.com/typesafehub/constructr-zookeeper