0
votes

I've got a question on the design of documents in order to be able to efficiently perform aggregation. I will take a dummy example of document :

{
   product: "Name of the product",
   description: "A new product",
   comments: [ObjectId(xxxxx), ObjectId(yyyy),....]
}

As you could see, I have a simple document which describes a product and wraps some comments on it. Imagine this product is very popular so that it contains millions of comments. A comment is a simple document with a date, a text and eventually some other features. The probleme is that such a product can easily be larger than 16MB so I need not to embed comments in the product but in a separate collection.

What I would like to do now, is to perform aggregation on the product collection, a first step could be for example to select various products and sort the comments by date. It is a quite easy operation with embedded documents, but how could I do with such a design ? I only have the ObjectId of the comments and not their content. Of course, I'd like to perform this aggregation in a single operation, i.e. I don't want to have to perform the first part of the aggregation, then query the results and perform another aggregation.

I dont' know if that's clear enough ? ^^

1
Can you please provide some realistic sample data and output you expect? - Vishwas
I'm afraid no. It is more a conceptual question, but I think the given sample document is a good example to start with. The question is how to deal with referenced documents when performing aggregation with MongoDB. The answer is generally to embed referenced documents, but what if it is not possible due to size limitations ? - GuillaumeA
For your example, you would just select on the product field in the comment collection and then sort - it doesn't even require an aggregation. If you want to get the product documents and the comment documents, you need to do an application-level join. I would search for comments and then resolve the product reference rather than the other way around. There's no way to get info from more than one collection in one operation. - wdberkeley

1 Answers

0
votes

I would go about it this way: create a temp collection that is the exact copy of the product collection with the only exception being the change in the schema on the comments array, which would be modified to include a comment object instead of the object id. The comment object will only have the _id and the date field. The above can be done in one step:

var comments = [];
db.product.find().forEach( function (doc){
    doc.comments.forEach( function(x) {
        var obj = {"_id": x };
        var comment = db.comment.findOne(obj);
        obj["date"] = comment.date;
        comments.push(obj);
    });
    doc.comments = comments;
    db.temp.insert(doc);
});

You can then run your aggregation query against the temp collection:

db.temp.aggregate([
    {
        $match: {
            // your match query
        }
    },
    {
        $unwind: "$comments"
    },    
    {
        $sort: { "comments.date": 1 } // sort the pipeline by comments date
    }
]);