We have an ejabberd cluster set up consisting of two hosts with which we are running into issues during restarts of the hosts. We are seeing inconsistent_database errors logged in. However, we cannot conclusively analyse what in configurations or module_init executions may actually cause the behaviour. Deleting the mnesia on node1 may help resolve the issue. Yet, it is not desirable for administration purposes.
Would like to request a review of below data along with some configuration and feedback on what may actually be causing the behavior as well as how to mitigate it.
Thank you in advance.
The environment configuration is as follows:
- Ejabberd Verison : 16.03
- Number of hosts :2
- odbc_type : MySQL
Error logged:
** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, other_node}
Repro step:
- Restart node1
- Restart node2
NB: it does not repro if the hosts are restarted in reverse order.
MnesiaInfo:
There seems to be two schemas with different entry size and possbily content on either nodes: muc_online_room and our custom schema as renamed SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME down below:
Node1:
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
mod_register_ip: with 0 records occupying 299 words of mem
muc_online_room: with 348 records occupying 10757 words of mem
http_bind : with 0 records occupying 299 words of mem
carboncopy : with 0 records occupying 299 words of mem
oauth_token : with 0 records occupying 299 words of mem
session : with 0 records occupying 299 words of mem
session_counter: with 0 records occupying 299 words of mem
sql_pool : with 10 records occupying 439 words of mem
route : with 4 records occupying 405 words of mem
iq_response : with 0 records occupying 299 words of mem
temporarily_blocked: with 0 records occupying 299 words of mem
s2s : with 0 records occupying 299 words of mem
route_multicast: with 0 records occupying 299 words of mem
shaper : with 2 records occupying 321 words of mem
access : with 28 records occupying 861 words of mem
acl : with 6 records occupying 459 words of mem
local_config : with 32 records occupying 1293 words of mem
schema : with 19 records occupying 2727 words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME : with 2457 records occupying 49953 words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes = [SCRUBBED_NODE2,SCRUBBED_NODE1]
stopped db nodes = []
master node tables = []
remote = []
ram_copies = [access,acl,carboncopy,http_bind,iq_response,
local_config,mod_register_ip,muc_online_room,route,
route_multicast,s2s,session,session_counter,shaper,
sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies = [oauth_token,schema]
disc_only_copies = []
[{'SCRUBBED_NODE1',disc_copies},
{'SCRUBBED_NODE2',disc_copies}] = [schema,
oauth_token]
[{'SCRUBBED_NODE1',ram_copies}] = [local_config,
acl,access,
shaper,
sql_pool,
mod_register_ip]
[{'SCRUBBED_NODE1',ram_copies},
{'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
s2s,
temporarily_blocked,
iq_response,
route,
session_counter,
session,
carboncopy,
http_bind,
muc_online_room,
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
2623 transactions committed, 35 aborted, 26 restarted, 60 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
Node2:
mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
mod_register_ip: with 0 records occupying 299 words of mem
muc_online_room: with 348 records occupying 8651 words of mem
http_bind : with 0 records occupying 299 words of mem
carboncopy : with 0 records occupying 299 words of mem
oauth_token : with 0 records occupying 299 words of mem
session : with 0 records occupying 299 words of mem
session_counter: with 0 records occupying 299 words of mem
route : with 4 records occupying 405 words of mem
sql_pool : with 10 records occupying 439 words of mem
iq_response : with 0 records occupying 299 words of mem
temporarily_blocked: with 0 records occupying 299 words of mem
s2s : with 0 records occupying 299 words of mem
route_multicast: with 0 records occupying 299 words of mem
shaper : with 2 records occupying 321 words of mem
access : with 28 records occupying 861 words of mem
acl : with 6 records occupying 459 words of mem
local_config : with 32 records occupying 1293 words of mem
schema : with 19 records occupying 2727 words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME : with 2457 records occupying 38232 words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes = ['SCRUBBED_NODE1','SCRUBBED_NODE2']
stopped db nodes = []
master node tables = []
remote = []
ram_copies = [access,acl,carboncopy,http_bind,iq_response,
local_config,mod_register_ip,muc_online_room,route,
route_multicast,s2s,session,session_counter,shaper,
sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies = [oauth_token,schema]
disc_only_copies = []
[{'SCRUBBED_NODE1',disc_copies},
{'SCRUBBED_NODE2',disc_copies}] = [schema,
oauth_token]
[{'SCRUBBED_NODE1',ram_copies},
{'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
s2s,
temporarily_blocked,
iq_response,
route,
session_counter,
session,
carboncopy,
http_bind,
muc_online_room,
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
[{'SCRUBBED_NODE2',ram_copies}] = [local_config,
acl,access,
shaper,
sql_pool,
mod_register_ip]
2998 transactions committed, 18 aborted, 0 restarted, 99 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok