I'm trying to figure out what's the best way to approach the following load-distribution/process-uniqueness-guarantee problem for an Elixir application.
The application
My Elixir application is started on n
different nodes (randomly chosen from a large pool, no fixed IP or host name known upfront) forming a cluster (I am not sure what's the best way of doing node discovery yet but let's ignore that for now).
In short, the application's main purpose is to keep two systems in-sync over time, basically an integration. There is an integration per user and a new integration can be added or an existing one can be removed at any time.
The problem
I'd like to have one Erlang process per integration as it is very elegant conceptually and brings many benefits (such as having a natural synchronization point for each integration). It seems it's the way to go to scale the system as well.
The issue is that obviously this process needs to be unique across the whole cluster (difficult to predict what could happen to the data if two processes attempts to synchronize the same integration) and I'd like to redistribute the work automatically as nodes fail or a new integration comes in.
Also, when deploying a new version of the application, the new cluster is started before the old one is shut down (we do not rely on hot-code reloading). This phase of transition needs to be handled somehow.
Possible solution
One solution could be to rely on a global process. When starting, nodes register themselves, connect to other registered nodes then attempt to start their copy of a global Scheduler
process whose only role is to start integration processes across nodes.
Although this provides fault-tolerance, it does not guarantee one process per integration as the cluster can be split into two by a network partition. It also does not handle the brief period where both the old and new cluster are online and the old cluster is still doing work.
Some kind of global locking mechanism (via a shared Redis instance?) could be used to deal with both network partitions and the application restarting, but that seems fairly hacky.
Any suggestions?
Thanks!