6
votes

I have a Mongo cluster with 2 shards, RS1 and RS2. RS1 has about 600G (*), RS2 about 460G. A few minutes ago, I added a new shard, RS3. When I connect to mongos and check status, here is what I see:

mongos> db.printShardingStatus()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
        {  "_id" : "RS1",  "host" : "RS1/dbs1d1:27018" }
        {  "_id" : "RS2",  "host" : "RS2/dbs1d2:27018" }
        {  "_id" : "RS3",  "host" : "RS3/dbs3a:27018" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "demo",  "partitioned" : false,  "primary" : "RS1" }
        {  "_id" : "cm_prod",  "partitioned" : true,  "primary" : "RS1" }
                cm_prod.profile_daily_stats chunks:
                                RS2     16
                                RS1     16
                        too many chunks to print, use verbose if you want to force print
                cm_prod.profile_raw_stats chunks:
                                RS2     157
                                RS1     157
                        too many chunks to print, use verbose if you want to force print
                cm_prod.video_latest_stats chunks:
                                RS1     152
                                RS2     153
                        too many chunks to print, use verbose if you want to force print
                cm_prod.video_raw_stats chunks:
                                RS1     3257
                                RS2     3257
                        too many chunks to print, use verbose if you want to force print
          [ ...various unpartitioned DBs snipped...]

So, the new RS3 shard appears in the list of shards, but not in the list of "how many chunks does each shard have". I would have expected it to appear in that list with a count of 0 for all sharded collections.

Is this expected behavor that will sort itself out if I want a bit?

2

2 Answers

3
votes

It will start to have chunks moved over to it, yes, in fact it will be the default target for every chunk move for the foreseeable future (basic selection is to move from shard with most to the shard with least chunks). Each shard primary can only take part in a single migration at a time, so with that many chunks to move it is going to take some time, especially if the other two are busy.

I have seen cases where people have turned off the balancer and forgot about it. Given that your other 2 shards are balanced pretty well, I don't think that is the case here, but just in case....

You can check on the status of the balancer by connecting to the mongos and then doing the following:

use config;
db.settings.find( { _id : "balancer" } )

Make sure that "stopped" is not set to true.

To see what is holding the lock, and hence balancing at that time:

use config;
db.locks.find({ _id : "balancer" });

Finally, to check what the balancer is actually doing, look at the mongos log on that machine. The balancer outputs rows to the log prefixed by [Balancer]. You can also look for migration messages in the logs of the primary mongod instances in the logs.

EDIT: This was likely caused by SERVER-7003 - a bug found in 2.2.0 post release. If there are deletes in the range (chunk) being migrated from the source shard, it can sometimes cause this sort of paralysis where all chunk migrations are aborted and the target shard appears to always be taking part in a migration, when in fact it is not.

Since this has been fixed in 2.2.1 an upgrade is the recommended path to resolve the issue. Though it can be resolved by restarts and/or when the bad state on the target shard resolves itself, as seems to be the case in the comments below.

2
votes

instead use db.printShardingStatus(true); it will print list of shards,chunks and all other details