I have MongoDB cluster with 4 shards. My shard key is: { client_id: 1, date: 1 } Used collection have about 50 M of documents.
I will paste 1 document as sample data:
{
"_id" : ObjectId("54e069353e9104db470065e6"),
"campaign_name" : "SC - Biker Planet",
"adgroup_name" : "motociclista cerca",
"client_id" : NumberLong(143),
"adgroup_id" : NumberLong(28469),
"campaign_id" : NumberLong(849),
"device" : "desktop",
"clicks" : NumberLong(0),
"conv" : NumberLong(0),
"cost" : NumberLong(0),
"impressions" : NumberLong(1),
"date" : ISODate("2014-02-22T05:00:00.000Z")
}
Now, i have installed MongoDB 3.0 RC9 (with great wiredTiger storage engine), and i am comparing following 2 queries:
A)
db.google_raw_id.aggregate([{
$match: {
client_id: 143,
campaign_name: 'SC - Dating For Seniors',
date: {
$gte: ISODate("2014-01-10T00:00:00.0Z"),
$lte: ISODate("2015-01-10T00:00:00.0Z")
}
}
}, {
$group: {
_id: "$campaign_name",
cost: {$sum: "$cost"},
clicks: {$sum: "$clicks"},
impressions: {$sum: "$impressions"}
}
}])
and
B)
db.google_raw_id.aggregate([{
$match: {
client_id: 143,
date: {
$gte: ISODate("2014-01-10T00:00:00.0Z"),
$lte: ISODate("2015-01-10T00:00:00.0Z")
}
}
}, {
$group: {
_id: "$campaign_name",
cost: {$sum: "$cost"},
clicks: {$sum: "$clicks"},
impressions: {$sum: "$impressions"}
}
}, {
$match: {
_id: 'SC - Dating For Seniors',
}
}])
Query A take about 0.35 sec to be executed, and query B take about 1.1 sec, and they both return same result. As you can see only diff between them is that i moved "campaign_name" filter from $match before $group to $match after $group.
Is this ok approach to use, since shard key { client_id, date } only work fast if there is no other filters in first $match, or i am doing something wrong in my configuration?
UPDATE
Here i have explain for query A and B with MongoDB aggregation used http://pastebin.com/8fW7G6ty
Here i have query A and B simplified to use .find() http://pastebin.com/xa7UYBuh
I also have index on campaign_name, but it seems that aggregation query is slower if campaign_name is used in first match, because that's not part of shard key, and that's why Mongo have to check in more than 1 shard.
To me that's logical explanation why query B is faster. In query B, Mongo reduce result to smaller data set, and than apply campaign_name on that, so it's faster and does not go over other shards. Just i was expecting that MongoDB can figure out that automatically :)
$matchstages and turn them into find queries and run explain. - wdberkeleycampaign_namewon't make it query any more shards than it would have already - it can still target the query usingclient_idanddate. - wdberkeleycampaign_namewon't make it query any more shards than it would hve already - it can still target the query usingclient_idand `date. - wdberkeley