MongoDB, sharding problems: fail mongos process after config server was crashed

Question

I have some problems with creating mongoDB shard cluster. I try to use 4 servers: 3 for mongo databases (host1, host2 and host3) and one for application side (for mongos process). On each database server I start 4 processes:

$ mongod --configsvr --smallfiles --noprealloc --port 27020 --dbpath /mongodb/conf --logappend --logpath=/mongodb/logs/logsmongodcfg.log
$ mongod --shardsvr  --smallfiles --noprealloc --replSet repl1 --port 27030 --dbpath /mongodb/repl1 --logappend --logpath=/mongodb/logs/mongod_shard1.log
$ mongod --shardsvr  --smallfiles --noprealloc --replSet repl2 --port 27031 --dbpath /mongodb/repl2 --logappend --logpath=/mongodb/logs/mongod_shard2.log
$ mongod --shardsvr  --smallfiles --noprealloc --replSet repl3 --port 27032 --dbpath /mongodb/repl3 --logappend --logpath=/mongodb/logs/mongod_shard3.log

As you can see on each server in cluster we have one config server and 3 mongod servers for replication implementation. On application server I start only one mongos process:

mongos --configdb host1:27020,host2:27020,host3:27020 --port 27017 --logappend --logpath=/var/log/mongo/mongos.log

After this I try to configure sharding:

mongo 127.0.0.1:27017/admin

db.runCommand( { addShard : "repl1/host1:27030,host2:27030,host3:27030" } );
db.runCommand( { addShard : "repl2/host1:27031,host2:27031,host3:27031" } );
db.runCommand( { addShard : "repl3/host1:27032,host2:27032,host3:27032" } );

And this scheme is working, but there is one big problem. If i try to shutdown one of the hosts, mongos can't connect to the other hosts and to the new primary replications. In mongos logs I get such information:

Thu Jun 14 21:10:37 [CheckConfigServers] DBClientCursor::init call() failed
Thu Jun 14 21:10:37 [ReplicaSetMonitorWatcher] trying reconnect to host1:27030
Thu Jun 14 21:10:42 [ReplicaSetMonitorWatcher] reconnect host1:27030 failed couldn't connect to server      host1:27030
Thu Jun 14 21:10:42 [ReplicaSetMonitorWatcher] trying reconnect to host1:27032
Thu Jun 14 21:10:47 [ReplicaSetMonitorWatcher] reconnect host1:27032 failed couldn't connect to server host1:27032
Thu Jun 14 21:10:56 [LockPinger] SyncClusterConnection connecting to [host1:27020]

So if any of 3 config servers goes down mongos got connection exception. What's wrong and how to resolve this problem?

Adam Comerford Adam Comerford · Accepted Answer · 2012-06-15T11:24:07

So, a couple of things here. First, if you are not running 2.0.6, then update to it - there are several fixes relevant here (like https://jira.mongodb.org/browse/SERVER-2988 which was fixed in 2.0.5 actually, but there are some other nice-to-haves in 2.0.6) that can help if you are starting mongos with a config server down.

Next, if you shut down a config server, your cluster meta data goes read only and the mongos cannot do several things (like balancing, splits etc.) until the config server comes back online. So, it is going to complain about the fact that one is down until you restore it.

The replica set monitor thread similarly will continue to ping the members of the replica set that are down and fail to connect to them (it's not actually an ICMP ping, it is a TCP connection attempt).

Basically these log messages are expected until you bring things back up.

MongoDB, sharding problems: fail mongos process after config server was crashed

1 Answers