MongoDB Aggregation Framework performance slow over millions of documents

Question

background

Our system is carrier grade and extremely robust, it has been load tested to handle 5000 transactions per second, and for each transaction a document is inserted into a single MongoDB collection (no updates or queries in this application, it is write-only). That amounts to ~700MM documents per day which is our benchmark.

The MongoDB deployment is not yet sharded, we have 1x replicaset with 1 master and 2 slaves all of which are type m2.2xlarge instances on ec2. Each instance is backed by a 1TB RAID0 stripe consisting of 8 volumes (no PIOPS). We are using the node-mongodb-native driver with c++ native BSON parser for optimal write performance and have attempted to model the document structure accordingly.

note

Documents are tiny (120 bytes)
The document includes a “time bucket” (h[our], d[ay], m[onth], y[ear]) along with the “t[ime]” field
We have an index on the collection to query by “c[ustomer]” and “a” which is a highly random but non-unique tag
We have looked into partitioning data into separate collections, though in this example all data is hot.
We are also looking into pre-aggregation though this cannot be done in realtime.

requirement

For reporting we need to calculate the amount of unique “a” tags per month, along with their totals by customer over any given period
A report takes about 60sec to run over a sample (full collection) of 9.5MM documents stored over 2 hours. Details below:

document

{
  _id: ObjectID(),
  a: ‘string’,
  b: ‘string’,
  c: ‘string’ or <int>,
  g: ‘string’ or <not_exist>,
  t: ISODate(),
  h: <int>,
  d: <int>,
  m: <int>,
  y: <int>
}

index

col.ensureIndex({ c: 1, a: 1, y: 1, m: 1, d: 1, h: 1 });

aggregation query

col.aggregate([
    { $match: { c: 'customer_1', y: 2013, m: 11 } },
    { $group: { _id: { c: '$c', y: '$y', m: '$m' }, a: { $addToSet: '$a' }, t: { $sum: 1 } } },
    { $unwind: '$a' },
    { $group: { _id: { c: '$_id.c', y: '$_id.y', m: '$_id.m', t: '$t' }, a: { $sum: 1 } } },
    { $sort: { '_id.m': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            m: '$_id.m',
            a: 1,
            t: '$_id.t'
        }
    },
    { $group: { _id: { c: '$c', y: '$y' }, monthly: { $push: { m: '$m', a: '$a', t: '$t' } } } },
    { $sort: { '_id.y': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            monthly: 1
        }
    },
    { $group: { _id: { c: '$c' }, yearly: { $push: { y: '$y', monthly: '$monthly' } } } },
    { $sort: { '_id.c': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            yearly: 1
        }
    }    
]);

aggregation result

[
    {
        "yearly": [
            {
                "y": 2013,
                "monthly": [
                    {
                        "m": 11,
                        "a": 3465652,
                        "t": 9844935
                    }
                ]
            }
        ],
        "c": "customer_1"
    }
]

63181ms

aggregation explain

{
        "cursor" : "BtreeCursor c_1_a_1_y_1_m_1_d_1_h_1",
        "isMultiKey" : false,
        "n" : 9844935,
        "nscannedObjects" : 0,
        "nscanned" : 9844935,
        "nscannedObjectsAllPlans" : 101,
        "nscannedAllPlans" : 9845036,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 27,
        "nChunkSkips" : 0,
        "millis" : 32039,
        "indexBounds" : {
                "c" : [ [ "customer_1", "customer_1" ] ],
                "a" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "y" : [ [ 2013, 2013 ] ],
                "m" : [ [ 11, 11 ] ],
                "d" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "h" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ]
        }
}

questions

Given the high frequency of inserts, and our need to perform ranged aggregation queries over time. Is the time bucket good practice considering the application can insert 30MM documents in a single hour period?
We were of the understanding that MongoDB can query billions of documents in seconds:
- Surely our aggregation query over 9.5MM documents could return in 1sec or so?
- Are we using the right technique to achieve this and if not where should we be focusing our efforts to getting report results almost instantly?
- Is it possible without sharding at this stage?
Would MapReduce (parallel) be a better alternative?

10M docs processed in 60 seconds is 6 microseconds per doc. That feels about right. Sounds like some background MapReduce jobs are in order to pre-aggregate as much as possible. — JohnnyHK
this is probably a more complex question than can be answered here, but there are several indicators in the explain - 32 seconds was spent scanning the index. that means the other 31 were spent doing the rest of the aggregation. I suspect that there are more efficient ways to store the document and index (reducing their size) - is there a reason "a" is where it is in the index (rather than at the end?) — Asya Kamsky
@JohnnyHK I tend to agree with you that background pre-aggregation is necessary, though that said I still believe there is room for improvement in the speed of this query. — Ashley Brener
@AsyaKamsky by more efficient ways of storage are you referring to partitioning the collection, or the actual size of the document which really couldn't be made much smaller other than removing the time bucket. In relation to this, "a"'s position in the index is legacy since prior to the time bucket's existence, the index was { c: 1, a: 1, t: 1 } and the aggregation query was ranged ($gte, $lt). Our understanding is that in this case, the ranged field should be at the end of the index. Is the use of a time bucket generally a more performant model than querying on time fields? — Ashley Brener
BTW index scan time drops to 25sec with "a" positioned at the end of the index, total query time 52sec. — Ashley Brener

Asya Kamsky Asya Kamsky · Accepted Answer · 2014-01-30T06:07:09

You wonder why your aggregation is taking so "long". Aside from the points made by @Avish in his answer (you are doing some unnecessary steps) you also have to consider your physical resources and where the time is going.

Here is part of your "explain":

    "n" : 9844935,
    "nscannedObjects" : 0,
    "nscanned" : 9844935,
    "scanAndOrder" : false,
    "indexOnly" : true,
    "nYields" : 27,
    "millis" : 32039,

Things of note are the fact that the aggregation took 32 seconds (not 50), it never had to fetch a single document as it got all of the information from the index itself. It didn't have to do any in-memory sorts. But it did have to yield 27 times... Why is that? There are two reasons that reader processes yield - one is when there is a write waiting (writers take precedence and long running reads have to yield to them) or there was a page fault - all operations must yield when any data they are trying to access is not in RAM (this is to prevent a process from blocking others from doing work while they are waiting for their data to be loaded into RAM).

Questions that come to mind are: was the DB cold? Do the indexes fit in RAM? Were there writes happening at the same time that the reads had to contend with?

I would check that the index can fit in RAM, run the "touch" command to make sure it is in RAM, simplify my aggregation pipeline not to do unnecessary work and then run it again, a couple of times in a row and see how the timings look then.