Aggregation pipeline slow with large collection

Question

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:

{
  "dimensions": {
    "account_id": ObjectId("590889944befcf34204dbef2"),
    "url": "https://test.com",
    "date": ISODate("2018-03-04T23:00:00.000+0000")
  },
  "metrics": {
    "cost": 155,
    "likes": 200
  }
}

I have three indexes on this collection, as there are various aggregations being ran on this collection:

account_id
date
account_id and date

The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:

db.large_collection.aggregate(

    [
        {
            $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
        },

        {
            $match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
        },

        {
            $group: { 
              cost: { $sum: "$metrics.cost" }, 
              likes: { $sum: "$metrics.likes" }, 
              _id: { 
                year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }, 
                week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } } 
              } 
            }
        },

        { 
            $project: {
                cost: 1, 
                likes: 1 
            }
        }
    ],

    {
        cursor: {
            batchSize: 50
        },
        allowDiskUse: true
    }

);

This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.

Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?

Does the full index fit in memory? docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram — Kevin Smith
@KevinSmith Yes it does. Total index size is 2.1GiB on a VPS with 8GB. Not much running apart from MongoDB. Perhaps something to note is that if I remove the $group and $project stages, the query returns within 200ms. — mtricht
Interesting, Does including cost and likes on the index help? that way it wouldn't need to go back and look up the document for the group — Kevin Smith
@KevinSmith Didn't help sadly; I added the execution stats to my question if that helps anything. — mtricht

Xavier Guihot Xavier Guihot · Accepted Answer · 2018-03-23T19:11:03

The following could improve performances if and only if precomputing dimensions within each record is an option.

If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.

This hasn't been benchmarked.

One of the costly parts of this query probably comes from working with dates.

First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.

The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).

In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.

Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):

db.large_collection.aggregate([
  { $addFields: {
    "prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
    "prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
    "prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
  }},
  { "$out": "large_collection_precomputed" }
])

which will store these documents:

{
  "dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
  "metrics" : { "cost" : 155, "likes" : 200 },
  "prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}

And let's query:

db.large_collection_precomputed.aggregate([
  // Initial gross filtering of dates (months) (on 200M documents):
  { $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
  { $match: {
    "dimensions.account_id": { $in: [
      ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
    ]}
  }},
  // Exact filtering of dates (costlier, but only on ~1M5 documents).
  { $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
  { $group: {
    // The _id is now extremly fast to retrieve:
    _id: { year: "$prec.year", "week": "$prec.week" },
    cost: { $sum: "$metrics.cost" },
    likes: { $sum: "$metrics.likes" }
  }},
  ...
])

In this case we would use indexes on account_id and month.

Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted

As a side effect, this obviously will make the storage disk/ram of the collection heavier.

Aggregation pipeline slow with large collection

1 Answers