1
votes

I'm using mongodb aggregate to sample documents from a large collection.

https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

After making several consecutive calls, I see the memory of mongodb climbing up, and after around the 12th call, it crashes with OutOfMemory error.

How can I tell Mongodb to free up the memory after it has finished processing a query?

3
check out the fields index , or read the mongolog for COLSCAN informationDiego Otero

3 Answers

0
votes

The reason you are asking this is because you don't know how the $sample operator works. As mentioned in the documentation,

In order to get N random documents:

  • If N is greater than or equal to 5% of the total documents in the collection, $sample performs a collection scan, performs a sort, and then select the top N documents. As such, the $sample stage is subject to the sort memory restrictions.

  • If N is less than 5% of the total documents in the collection, If using WiredTiger Storage Engine, $sample uses a pseudo-random cursor over the collection to sample N documents. If using MMAPv1 Storage Engine, $sample uses the _id index to randomly select N documents.

So I think the number of random documents you want to get is greater than 5%. What you need is set allowDiskUse to True.

collection.aggregate(pipeline, allowDiskUse=True)
0
votes

You should set allowDiskUse value to true. For example:

db.books.aggregate( [
           { $group : { _id : "$author", books: { $push: "$title" } } },
                      {allowDiskUse:true}
                  ] )

Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.

You can read more about this here.

-1
votes

It turns out the issue was the storage engine cache. I'm using an EC2 instance, and it resulted in OOM error there. I've been able to solve it by assigning a smaller cache size like this:

mongod --dbpath /a/path/to/db --logpath /a/path/to/log --storageEngine wiredTiger --wiredTigerEngineConfigString="cache_size=200M" --fork