0
votes

I am in the process of moving from mysql to mongodb. Started learning mongodb yesterday.

I have a big mysql table (over 4 million rows, with over 300 fields each) which I am moving to mongodb.

Let's assume, the products table have the following fields -

_id, category, and 300+ other fields.

To find the top 5 categories in the products along with their count, I have the following mysql query

Select category, count(_id) as N from products group by category order by N DESC limit 5;

I have an index on category field and this query takes around 4.4 sec in mysql.

Now, I have successfully moved this table to mongodb and this is my corresponding query for finding top 5 categories with their counts.

db.products.aggregate([{$group : {_id:"$category", N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}]);

I again have an index on category but the query doesn't seem to be using it (explain : true says so) and it is also taking around 13.5 sec for this query.

Having read more about mongodb aggregation pipeline optimization, I found out that we need to use sort prior to aggregation for index to work but I am sorting on the derived field from aggregation so can't bring it before the aggregate function.

How do I optimize queries like these in mongodb?

========================================================================= Output of explain

db.products.aggregate([{$group : {_id:"$category",N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}], { explain: true });
{
        "waitedMS" : NumberLong(0),
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                            },
                            "fields" : {
                                    "category" : 1,
                                    "_id" : 0
                            },
                            "queryPlanner" : {
                                    "plannerVersion" : 1,
                                    "namespace" : "mydb.products",
                                    "indexFilterSet" : false,
                                    "parsedQuery" : {
                                            "$and" : [ ]
                                    },
                                    "winningPlan" : {
                                            "stage" : "COLLSCAN",
                                            "filter" : {
                                                    "$and" : [ ]
                                            },
                                            "direction" : "forward"
                                    },
                                    "rejectedPlans" : [ ]
                            }
                    }
            },
            {
                    "$group" : {
                            "_id" : "$category",
                            "N" : {
                                    "$sum" : {
                                            "$const" : 1
                                    }
                            }
                    }
            },
            {
                    "$sort" : {
                            "sortKey" : {
                                    "N" : -1
                            },
                            "limit" : NumberLong(5)
                    }
            }
    ],
    "ok" : 1
}
1
Can you post the output of the explain query please - Alex
Added output of explain query. - Aakash

1 Answers

0
votes

There are currently some limitations in what aggregation framework can do to improve the performance in our use case, however, you should be able to speed up the query by sorting on category first. This will force the query to use the index you have added and should speed up the group query in the second part of your pipeline:

 db.products.aggregate([ 
    { "$sort" : { "category" : 1 },
    {$group : {_id:"$category",N:{$sum:1}}},
    {$sort:{N: -1}},{$limit:5}]);