How to solve Mnesia - inconsistent_database error in clustered ejabberd environment?

Question

We have an ejabberd cluster set up consisting of two hosts with which we are running into issues during restarts of the hosts. We are seeing inconsistent_database errors logged in. However, we cannot conclusively analyse what in configurations or module_init executions may actually cause the behaviour. Deleting the mnesia on node1 may help resolve the issue. Yet, it is not desirable for administration purposes.

Would like to request a review of below data along with some configuration and feedback on what may actually be causing the behavior as well as how to mitigate it.

Thank you in advance.

The environment configuration is as follows:

Ejabberd Verison : 16.03
Number of hosts :2
odbc_type : MySQL

Error logged:

    ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, other_node}

Repro step:

Restart node1
Restart node2

NB: it does not repro if the hosts are restarted in reverse order.

MnesiaInfo:

There seems to be two schemas with different entry size and possbily content on either nodes: muc_online_room and our custom schema as renamed SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME down below:

Node1:

---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <---
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0        records occupying 299      words of mem
muc_online_room: with 348      records occupying 10757    words of mem
http_bind      : with 0        records occupying 299      words of mem
carboncopy     : with 0        records occupying 299      words of mem
oauth_token    : with 0        records occupying 299      words of mem
session        : with 0        records occupying 299      words of mem
session_counter: with 0        records occupying 299      words of mem
sql_pool       : with 10       records occupying 439      words of mem
route          : with 4        records occupying 405      words of mem
iq_response    : with 0        records occupying 299      words of mem
temporarily_blocked: with 0        records occupying 299      words of mem
s2s            : with 0        records occupying 299      words of mem
route_multicast: with 0        records occupying 299      words of mem
shaper         : with 2        records occupying 321      words of mem
access         : with 28       records occupying 861      words of mem
acl            : with 6        records occupying 459      words of mem
local_config   : with 32       records occupying 1293     words of mem
schema         : with 19       records occupying 2727     words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME     : with 2457     records occupying 49953    words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes   = [SCRUBBED_NODE2,SCRUBBED_NODE1]
stopped db nodes   = [] 
master node tables = []
remote             = []
ram_copies         = [access,acl,carboncopy,http_bind,iq_response,
                      local_config,mod_register_ip,muc_online_room,route,
                      route_multicast,s2s,session,session_counter,shaper,
                      sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies        = [oauth_token,schema]
disc_only_copies   = []
[{'SCRUBBED_NODE1',disc_copies},
 {'SCRUBBED_NODE2',disc_copies}] = [schema,
                                                                  oauth_token]
[{'SCRUBBED_NODE1',ram_copies}] = [local_config,
                                                                 acl,access,
                                                                 shaper,
                                                                 sql_pool,
                                                                 mod_register_ip]
[{'SCRUBBED_NODE1',ram_copies},
 {'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
                                                                 s2s,
                                                                 temporarily_blocked,
                                                                 iq_response,
                                                                 route,
                                                                 session_counter,
                                                                 session,
                                                                 carboncopy,
                                                                 http_bind,
                                                                 muc_online_room,
                                                                 SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
2623 transactions committed, 35 aborted, 26 restarted, 60 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok

Node2:

mnesia:info().
---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <---
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0        records occupying 299      words of mem
muc_online_room: with 348      records occupying 8651     words of mem
http_bind      : with 0        records occupying 299      words of mem
carboncopy     : with 0        records occupying 299      words of mem
oauth_token    : with 0        records occupying 299      words of mem
session        : with 0        records occupying 299      words of mem
session_counter: with 0        records occupying 299      words of mem
route          : with 4        records occupying 405      words of mem
sql_pool       : with 10       records occupying 439      words of mem
iq_response    : with 0        records occupying 299      words of mem
temporarily_blocked: with 0        records occupying 299      words of mem
s2s            : with 0        records occupying 299      words of mem
route_multicast: with 0        records occupying 299      words of mem
shaper         : with 2        records occupying 321      words of mem
access         : with 28       records occupying 861      words of mem
acl            : with 6        records occupying 459      words of mem
local_config   : with 32       records occupying 1293     words of mem
schema         : with 19       records occupying 2727     words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME     : with 2457     records occupying 38232    words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes   = ['SCRUBBED_NODE1','SCRUBBED_NODE2']
stopped db nodes   = [] 
master node tables = []
remote             = []
ram_copies         = [access,acl,carboncopy,http_bind,iq_response,
                      local_config,mod_register_ip,muc_online_room,route,
                      route_multicast,s2s,session,session_counter,shaper,
                      sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies        = [oauth_token,schema]
disc_only_copies   = []
[{'SCRUBBED_NODE1',disc_copies},
 {'SCRUBBED_NODE2',disc_copies}] = [schema,
                                                                  oauth_token]
[{'SCRUBBED_NODE1',ram_copies},
 {'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
                                                                 s2s,
                                                                 temporarily_blocked,
                                                                 iq_response,
                                                                 route,
                                                                 session_counter,
                                                                 session,
                                                                 carboncopy,
                                                                 http_bind,
                                                                 muc_online_room,
                                                                 SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
[{'SCRUBBED_NODE2',ram_copies}] = [local_config,
                                                                 acl,access,
                                                                 shaper,
                                                                 sql_pool,
                                                                 mod_register_ip]
2998 transactions committed, 18 aborted, 0 restarted, 99 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok

Mickaël Rémond Mickaël Rémond · Accepted Answer · 2016-09-27T10:46:42

NB: it does not repro if the hosts are restarted in reverse order.

Inconsistent database is to protect data. If you stopped the cluster in one order, you have to restart it in reverse order. Otherwise, the first node stopped, will have recorded that there were other active node and what for the up to date information to prevent data loss.

How to solve Mnesia - inconsistent_database error in clustered ejabberd environment?

1 Answers