background
Our system is carrier grade and extremely robust, it has been load tested to handle 5000 transactions per second, and for each transaction a document is inserted into a single MongoDB collection (no updates or queries in this application, it is write-only). That amounts to ~700MM documents per day which is our benchmark.
The MongoDB deployment is not yet sharded, we have 1x replicaset with 1 master and 2 slaves all of which are type m2.2xlarge instances on ec2. Each instance is backed by a 1TB RAID0 stripe consisting of 8 volumes (no PIOPS). We are using the node-mongodb-native driver with c++ native BSON parser for optimal write performance and have attempted to model the document structure accordingly.
note
- Documents are tiny (120 bytes)
- The document includes a “time bucket” (h[our], d[ay], m[onth], y[ear]) along with the “t[ime]” field
- We have an index on the collection to query by “c[ustomer]” and “a” which is a highly random but non-unique tag
- We have looked into partitioning data into separate collections, though in this example all data is hot.
- We are also looking into pre-aggregation though this cannot be done in realtime.
requirement
- For reporting we need to calculate the amount of unique “a” tags per month, along with their totals by customer over any given period
- A report takes about 60sec to run over a sample (full collection) of 9.5MM documents stored over 2 hours. Details below:
document
{
_id: ObjectID(),
a: ‘string’,
b: ‘string’,
c: ‘string’ or <int>,
g: ‘string’ or <not_exist>,
t: ISODate(),
h: <int>,
d: <int>,
m: <int>,
y: <int>
}
index
col.ensureIndex({ c: 1, a: 1, y: 1, m: 1, d: 1, h: 1 });
aggregation query
col.aggregate([
{ $match: { c: 'customer_1', y: 2013, m: 11 } },
{ $group: { _id: { c: '$c', y: '$y', m: '$m' }, a: { $addToSet: '$a' }, t: { $sum: 1 } } },
{ $unwind: '$a' },
{ $group: { _id: { c: '$_id.c', y: '$_id.y', m: '$_id.m', t: '$t' }, a: { $sum: 1 } } },
{ $sort: { '_id.m': 1 } },
{
$project: {
_id: 0,
c: '$_id.c',
y: '$_id.y',
m: '$_id.m',
a: 1,
t: '$_id.t'
}
},
{ $group: { _id: { c: '$c', y: '$y' }, monthly: { $push: { m: '$m', a: '$a', t: '$t' } } } },
{ $sort: { '_id.y': 1 } },
{
$project: {
_id: 0,
c: '$_id.c',
y: '$_id.y',
monthly: 1
}
},
{ $group: { _id: { c: '$c' }, yearly: { $push: { y: '$y', monthly: '$monthly' } } } },
{ $sort: { '_id.c': 1 } },
{
$project: {
_id: 0,
c: '$_id.c',
yearly: 1
}
}
]);
aggregation result
[
{
"yearly": [
{
"y": 2013,
"monthly": [
{
"m": 11,
"a": 3465652,
"t": 9844935
}
]
}
],
"c": "customer_1"
}
]
63181ms
aggregation explain
{
"cursor" : "BtreeCursor c_1_a_1_y_1_m_1_d_1_h_1",
"isMultiKey" : false,
"n" : 9844935,
"nscannedObjects" : 0,
"nscanned" : 9844935,
"nscannedObjectsAllPlans" : 101,
"nscannedAllPlans" : 9845036,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 27,
"nChunkSkips" : 0,
"millis" : 32039,
"indexBounds" : {
"c" : [ [ "customer_1", "customer_1" ] ],
"a" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
"y" : [ [ 2013, 2013 ] ],
"m" : [ [ 11, 11 ] ],
"d" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
"h" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ]
}
}
questions
Given the high frequency of inserts, and our need to perform ranged aggregation queries over time. Is the time bucket good practice considering the application can insert 30MM documents in a single hour period?
We were of the understanding that MongoDB can query billions of documents in seconds:
- Surely our aggregation query over 9.5MM documents could return in 1sec or so?
- Are we using the right technique to achieve this and if not where should we be focusing our efforts to getting report results almost instantly?
- Is it possible without sharding at this stage?
Would MapReduce (parallel) be a better alternative?