I'm using Mongo 2.6.9 with a cluster of 2 shards, each shard has 3 replicas, one of which is hidden. It's a 5 machines deployment running on RedHat where 4 machines contain a single replica of 1 shard and the 5th machine contains the hidden replicas of both shards. There is a load running of around 250 inserts per second and 50 updates per second. These are simple inserts and updates of pretty small documents. In addition there is a small load of small files inserted to GridFS (around 1 file / second). The average file size is less than 1 MB.
There are 14 indexes defined for the involved collections. Those will be required when I will be adding the application that will be reading from the DB.
From the logs of the primary replicas I see during the whole run a huge amount of simple inserts and updates or even GetLastError requests that take hundreds of ms or even sometimes seconds (the default logging level only shows queries that took more than 100ms). For example this simple update uses an index for the query and doesn't update any index:
2015-10-12T06:12:17.258+0000 [conn166086] update chatlogging.SESSIONS query: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" } update: { $set: { EndTime: new Date(1444630335126) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:430 2131ms
2015-10-12T06:12:17.259+0000 [conn166086] command chatlogging.$cmd command: update { update: "SESSIONS", updates: [ { q: { _id: "743_12101506113018605820fe43610c0a81eb9_IM" }, u: { $set: { EndTime: new Date(1444630335126) } }, multi: false, upsert: false } ], writeConcern: { w: 1 }, ordered: true, metadata: { shardName: "S1R", shardVersion: [ Timestamp 17000|3, ObjectId('56017697ca848545f5f47bf5') ], session: 0 } } ntoreturn:1 keyUpdates:0 numYields:0 reslen:155 2132ms
All inserts and updates are made with w:1, j:1.
The machines have plenty of available CPU and memory. The disk I/O is significant, but not coming anywhere near 100% when these occur.
I really need to figure out what's causing this unexpectedly slow responsiveness of the DB. It's very possible that I need to change something in the way the DB is set up. Mongo runs with default configuration including the logging level.
An update -
I've continued looking into this issue and here are additional details that I'm hoping will allow to pinpoint the root cause of the problem or at least point me to the right direction:
The total DB size for a single shard is more than 200GB at the moment. The indexes being almost 50GB. Here is the relevant part from db.stats() and the mem part from db.ServerStatus() from the primary replica of one of the shards:
"collections" : 7,
"objects" : 73497326,
"avgObjSize" : 1859.9700916465995,
"dataSize" : 136702828176,
"storageSize" : 151309253648,
"numExtents" : 150,
"indexes" : 14,
"indexSize" : 46951096976,
"fileSize" : 223163187200,
"nsSizeMB" : 16,
"mem" : { "bits" : 64, "resident" : 5155, "virtual" : 526027, "supported" : true, "mapped" : 262129, "mappedWithJournal" : 524258 },
The servers have 8GB of RAM, out of which the mongod process use around 5GB. So the majority of the data and probably more important the indexes is not kept in memory. Can this be the our root cause? When I previously wrote that the system has plenty of free memory, I was refering to the fact that the mongod process isn't using as much as it could and also that most of the RAM is used for cached memory that can be released if required:
Here is the output of mongostat from the same mongod:
I do see few faults in these, but these numbers look too low to me to indicate a real problem. Am I wrong?
Also I don't know whether the numbers seen in "locked db" are considered reasonable, or do those indicate that we have locks contention?
During the same timeframe when these stats were taken, many simple update operations that find a document based on an index and don't update an index, like the following one took hundreds of ms:
2015-10-19T09:44:09.220+0000 [conn210844] update chatlogging.SESSIONS query: { _id: "838_19101509420840010420fe43620c0a81eb9_IM" } update: { $set: { EndTime: new Date(1445247849092) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:0 locks(micros) w:214 126ms
Many other types of insert or update operations take hundreds of ms too. So the issue looks to be system wide and not related to a specific type of query. Using mtools I'm not able to find operations that scan lots of documents.
I'm hoping that here I'll be able to get help with regards to finding the root cause of the problem. I can provide whatever additional info or statistics from the system.
Thank you in advance,
Leonid