Mongo removeShard stuck on draining and other shard not accessible

Question

I've got a 3 shard cluster consisting of the following shards:

bp-rs0
bp-rs1
bp-rs3

I want to remove 1 shard; bp-rs3.

I executed db.adminCommand( { removeShard: "bp-rs3" } ) and got back what I would expect, the typical acknowledgment.

It said I needed to drop or movePrimary one database which I no longer needed, so I dropped it. I'm not sure if that has caused my problem which is:

For a few hours now, the draining message returned by running db.adminCommand( { removeShard: "bp-rs3" } ) has said exactly the following:

{
    "msg" : "draining ongoing",
    "state" : "ongoing",
    "remaining" : {
        "chunks" : 334,
        "dbs" : 0
    },
    "note" : "you need to drop or movePrimary these databases",
    "dbsToMove" : [ ],
    "ok" : 1,
    "operationTime" : Timestamp(1629235413, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1629235413, 2),
        "signature" : {
            "hash" : BinData(0,"IkfHFSkxh7gQheeWlXsI/tTjU1U="),
            "keyId" : 6978594490403520515
        }
    }
}

Note the 334 remaining chunks. It hasn't changed for a long time.

This wouldn't be too much of an issue, but my most used collection is now un-queryable, which means the app it serves is unusable.

I get the following error when trying to query my only partitioned collection:

{
    "message" : "Encountered non-retryable error during query :: caused by :: Could not find host matching read preference { mode: 'primary' } for set bp-rs1",
    "ok" : 0,
    "code" : 133,
    "codeName" : "FailedToSatisfyReadPreference",
    "operationTime" : "Timestamp(1629232940, 1)",
    "$clusterTime" : {
        "clusterTime" : "Timestamp(1629232944, 2)",
        "signature" : {
            "hash" : "IlYQ/HU+EWYsm8CL2xtCziX6xtY=",
            "keyId" : "6978594490403520515"
        }
    },
    "name" : "MongoError"
}

I don't know why bp-rs1 would be affected at all. bp-rs0 is the primary.

sh.status returns the following:

--- Sharding Status --- 
  sharding version: {
    "_id" : NumberInt(1),
    "minCompatibleVersion" : NumberInt(5),
    "currentVersion" : NumberInt(6),
    "clusterId" : ObjectId("602d2def7771e35f1961e454")
  }
  shards:
        {  "_id" : "bp-rs0",  "host" : "bp-rs0/xxx:27020,xxx:27020",  "state" : NumberInt(1) }
        {  "_id" : "bp-rs1",  "host" : "bp-rs1/xxx:27020",  "state" : NumberInt(1) }
        {  "_id" : "bp-rs3",  "host" : "bp-rs3/xxx:27020",  "state" : NumberInt(1),  "draining" : true }
  active mongoses:
        "4.0.3" : 1
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set bp-rs1
        Time of Reported error:  Tue Aug 17 2021 23:09:45 GMT+0100 (British Summer Time)
        Migration Results for the last 24 hours: 
                241 : Success
                1 : Failed with error 'aborted', from bp-rs3 to bp-rs1
  databases:
        {  "_id" : "xxx",  "primary" : "bp-rs0",  "partitioned" : true,  "version" : {  "uuid" : UUID("c6301dba-1f34-4043-be6f-1e99dc9a8fb9"),  "lastMod" : NumberInt(1) } }
                xxx.listings
                        shard key: { "meta.canonical" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                bp-rs0  696
                                bp-rs1  695
                                bp-rs3  334
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : NumberInt(1) }
                        unique: false
                        balancing: true
                        chunks:
                                bp-rs0  1
                        { "_id" : MinKey } -->> { "_id" : MaxKey } on : bp-rs0 Timestamp(1, 0)

Is there something I can do? To either rollback and start again, or just to make everything work as it should?

Thanks in advance

Chrift Chrift · Accepted Answer · 2021-08-17T23:18:35

I connected to bp-rs2 to see that the service had crashed for some reason. I started it back up again and the migration finished as I had expected.

I don't know exactly the cause, but it could have been because I dropped that database while the drain was happening.

Mongo removeShard stuck on draining and other shard not accessible

1 Answers