2
votes

I have MongoDB cluster with 4 shards. My shard key is: { client_id: 1, date: 1 } Used collection have about 50 M of documents.

I will paste 1 document as sample data:


    {
        "_id" : ObjectId("54e069353e9104db470065e6"),
        "campaign_name" : "SC - Biker Planet",
        "adgroup_name" : "motociclista cerca",
        "client_id" : NumberLong(143),
        "adgroup_id" : NumberLong(28469),
        "campaign_id" : NumberLong(849),
        "device" : "desktop",
        "clicks" : NumberLong(0),
        "conv" : NumberLong(0),
        "cost" : NumberLong(0),
        "impressions" : NumberLong(1),
        "date" : ISODate("2014-02-22T05:00:00.000Z")
    }

Now, i have installed MongoDB 3.0 RC9 (with great wiredTiger storage engine), and i am comparing following 2 queries:

A)


    db.google_raw_id.aggregate([{
        $match: {
            client_id: 143,
            campaign_name: 'SC - Dating For Seniors',
            date: {
                $gte: ISODate("2014-01-10T00:00:00.0Z"),
                $lte: ISODate("2015-01-10T00:00:00.0Z")
            }
        }
    }, {
        $group: {
            _id: "$campaign_name",
            cost: {$sum: "$cost"},
            clicks: {$sum: "$clicks"},
            impressions: {$sum: "$impressions"}
        }
    }])

and

B)


    db.google_raw_id.aggregate([{
        $match: {
            client_id: 143,
            date: {
                $gte: ISODate("2014-01-10T00:00:00.0Z"),
                $lte: ISODate("2015-01-10T00:00:00.0Z")
            }
        }
    }, {
        $group: {
            _id: "$campaign_name",
            cost: {$sum: "$cost"},
            clicks: {$sum: "$clicks"},
            impressions: {$sum: "$impressions"}
        }
    }, {
        $match: {
            _id: 'SC - Dating For Seniors',
        }
    }])

Query A take about 0.35 sec to be executed, and query B take about 1.1 sec, and they both return same result. As you can see only diff between them is that i moved "campaign_name" filter from $match before $group to $match after $group.

Is this ok approach to use, since shard key { client_id, date } only work fast if there is no other filters in first $match, or i am doing something wrong in my configuration?

UPDATE

Here i have explain for query A and B with MongoDB aggregation used http://pastebin.com/8fW7G6ty

Here i have query A and B simplified to use .find() http://pastebin.com/xa7UYBuh

I also have index on campaign_name, but it seems that aggregation query is slower if campaign_name is used in first match, because that's not part of shard key, and that's why Mongo have to check in more than 1 shard.

To me that's logical explanation why query B is faster. In query B, Mongo reduce result to smaller data set, and than apply campaign_name on that, so it's faster and does not go over other shards. Just i was expecting that MongoDB can figure out that automatically :)

1
That's surprising. Just looking at it, B is naively more computation (naively because I don't know much about the actual documents in the collection, just their shape). The two pipelines should always produce the same results, so if the latter is better, then use it, for now. If you want to examine why B is faster, post an explain for both queries. You'll need to take the two initial $match stages and turn them into find queries and run explain.wdberkeley
I have updated my post with 2 explained queries links. B should query just 1 shard i think, because shard key is { client_id, date } but when campaign_name is used, that will query other shards too. At least that sound logical to me. So i think that best approach would be in first $match to have just shard key fields, and later to apply another $match.Nenad Milosavljević
It's a lot of information, but I don't see anything to explain the difference in time. Including campaign_name won't make it query any more shards than it would have already - it can still target the query using client_id and date.wdberkeley
It's a lot of information, but I don't see anything to explain the difference in time. Including campaign_name won't make it query any more shards than it would hve already - it can still target the query using client_id and `date.wdberkeley

1 Answers

0
votes

A quicker because after first 'match' you have already grouped records (all results after 'match' have the same 'campaign_name'). In 'group' just need to calculate them.

B slower because after first match you have more results and in 'group' need to work with more data. In other words script will calculte results for all found campaigns.