I have MongoDB cluster with 4 shards. My shard key is: { client_id: 1, date: 1 } Used collection have about 50 M of documents.
I will paste 1 document as sample data:
{ "_id" : ObjectId("54e069353e9104db470065e6"), "campaign_name" : "SC - Biker Planet", "adgroup_name" : "motociclista cerca", "client_id" : NumberLong(143), "adgroup_id" : NumberLong(28469), "campaign_id" : NumberLong(849), "device" : "desktop", "clicks" : NumberLong(0), "conv" : NumberLong(0), "cost" : NumberLong(0), "impressions" : NumberLong(1), "date" : ISODate("2014-02-22T05:00:00.000Z") }
Now, i have installed MongoDB 3.0 RC9 (with great wiredTiger storage engine), and i am comparing following 2 queries:
A)
db.google_raw_id.aggregate([{ $match: { client_id: 143, campaign_name: 'SC - Dating For Seniors', date: { $gte: ISODate("2014-01-10T00:00:00.0Z"), $lte: ISODate("2015-01-10T00:00:00.0Z") } } }, { $group: { _id: "$campaign_name", cost: {$sum: "$cost"}, clicks: {$sum: "$clicks"}, impressions: {$sum: "$impressions"} } }])
and
B)
db.google_raw_id.aggregate([{ $match: { client_id: 143, date: { $gte: ISODate("2014-01-10T00:00:00.0Z"), $lte: ISODate("2015-01-10T00:00:00.0Z") } } }, { $group: { _id: "$campaign_name", cost: {$sum: "$cost"}, clicks: {$sum: "$clicks"}, impressions: {$sum: "$impressions"} } }, { $match: { _id: 'SC - Dating For Seniors', } }])
Query A take about 0.35 sec to be executed, and query B take about 1.1 sec, and they both return same result. As you can see only diff between them is that i moved "campaign_name" filter from $match before $group to $match after $group.
Is this ok approach to use, since shard key { client_id, date } only work fast if there is no other filters in first $match, or i am doing something wrong in my configuration?
UPDATE
Here i have explain for query A and B with MongoDB aggregation used http://pastebin.com/8fW7G6ty
Here i have query A and B simplified to use .find() http://pastebin.com/xa7UYBuh
I also have index on campaign_name, but it seems that aggregation query is slower if campaign_name is used in first match, because that's not part of shard key, and that's why Mongo have to check in more than 1 shard.
To me that's logical explanation why query B is faster. In query B, Mongo reduce result to smaller data set, and than apply campaign_name on that, so it's faster and does not go over other shards. Just i was expecting that MongoDB can figure out that automatically :)
$match
stages and turn them into find queries and run explain. – wdberkeleycampaign_name
won't make it query any more shards than it would have already - it can still target the query usingclient_id
anddate
. – wdberkeleycampaign_name
won't make it query any more shards than it would hve already - it can still target the query usingclient_id
and `date. – wdberkeley