6
votes

I have an erlang application currently running on four nodes with a replicated mnesia db that stores minimal data regarding connected clients. The mnesia replication has been working seamlessly in the past (as far as I know anyway) but a client recently noticed that one of the nodes is missing some ids related to his application.

I'm not really sure how this happened. Our network may have had a hiccup at the time. Maybe? But, of more urgency at the moment is getting the data into a good state across all nodes. Is there a way to tell mnesia to replicate from a known-good node?

2

2 Answers

3
votes

Mnesia is legendary about this issue. It's a huge PITA.

Looking at it from CAP theorem's point of view, most systems built with Mnesia end up being C-A (consistency-availability with no partition tolerance) systems. For most of the time you have (and heavily rely on) its hard consistency. Then a network partition happens... It's still available for writes, but these writes destroy consistency. And later on, Mnesia has no mechanism for automatic data repair.

Everyone who uses Mnesia in a cluster should familiarize themselves with these tradeoffs. Your problem is a clear sign that using Mnesia was a poor choice. Double so if this data is critical to you.

I too use Mnesia in such a way (sometimes we all need speed you know). But I make sure to only use it to store data that I can easily reconstruct. In general, if you need it stored on disk, Mnesia is no good, except for toy projects.

I make sure to always have this function at hand:

reinit_mnesia_cluster() ->
    rpc:multicall(mnesia, stop, []),
    AllNodes = [node() | nodes()],
    mnesia:delete_schema(AllNodes),
    mnesia:create_schema(AllNodes),
    rpc:multicall(mnesia, start, []).

Use it only after the network partition has been resolved and all nodes are reachable. This will erase all Mnesia replicas and start it anew. Again, if you can't live with what it does, then using Mnesia was a poor choice.

For important data that needs hard consistency, use SQL. For important data that needs availability, use Riak. For shared state that needs speed, use Redis. Mnesia is no replacement for these systems, although at first it does seem so.

Edit on 2014-11-16: Here is a much better article on the topic, explaining in detail what I said above https://medium.com/@jlouis666/mnesia-and-cap-d2673a92850

1
votes

Honestly, I think the cleanest way to get an out-of-sync Mnesia to replicate from a known good node is to shut down the application on the bad node, and delete all its Mnesia database files, then do the following.

Write an escript that starts Mnesia up standalone using the "bad" node name and Mnesia directory, replicates the tables from a known good node, and shuts Mnesia down. Run that escript on the bad node.

The act of replicating the tables and shutting Mnesia down gracefully puts the node back in sync with the cluster. Then, when you start the application up on the bad node, it will join up and stay in sync with the cluster.

Of course, this description lacks precise details, but that's the gist of it. There are surely less brute force ways of doing this, but unless you have massive amounts of data to replicate, I think this way is the quickest and cleanest.