0
votes

MongoDB aggregation documentation on $out says:

"Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline. The $out operator lets the aggregation framework return result sets of any size."

https://docs.mongodb.org/manual/reference/operator/aggregation/out/

So, one issue may be that aggregation may run out of memory or use a lot of memory. But how $out will help here, ultimately if the aggregation returning a lot of buckets, they are to be held in memory first.

1
If your issue is actually something like receiving an error saying "BSON size limit exceeded" or similar then $out will not help here. As the documentation says, it just ouputs to a collection. This was a MongoDB 2.6 introduced feature, but then again so were cursors. Both allow a total output greater than 16MB ( which was a limit in previous versions ), but the limit per document still applies.Blakes Seven

1 Answers

3
votes

The $out operator is useful when you have a certain use-case which takes long to calculate but doesn't need to be current all the time.

Let's say you have a website where you want a list of the top ten currently most popular articles on the frontpage (most hits in the past 60 minutes). To create this statistic, you need to parse your access log collections with a pipeline like this:

  • $match the last hour
  • $group by article-id and user to filter out reloads
  • $group again by article-id to get the hit count for each article and user
  • $sort by count.
  • $limit to 10 results

When you have a very popular website with a lot of content, this can be a quite load-heavy aggregation. And when you have it on the frontpage you need to do it for every single frontpage hit. This can create a quite nasty load on your database and considerably bog down the loading time.

How do we solve this problem?

Instead of performing that aggregation on every page hit, we perform it once every minute with a cronjob which uses $out to put the aggregated top ten list into a new collection. You can then query the cached results in that collection directly. Getting all 10 results from a 10-document collection will be far faster than performing that aggregation all the time.