I have a Hazelcast cluster with 2 instances (running from Docker containers) and a Replicated map which is filled at initialization at the first instance. All works fine and quick. Recently I experienced several times the following situation:
- first instance restarted all of sudden, joined the Hazelcast cluster and started to sync data but didn't finish
- second instance restarted immediately after as well for no reason; it joined the Hazelcast cluster and synced all data from first instance
I ended up with a healthy cluster but in reality the cluster contained only partial data which were synced before instance 2 shut down. It took at least one day to observe this bad state and to refresh the data.
This problem happened in multiple environments (test, prod). The reasons why the instances restarted are unknown. My Hazelcast version: 3.7.2. My assumption is the same can happen with 3 or more instances too with less probability.
What are the best practices in a case like this? Thanks!