MongoDB 2.6 aggregation updates the $out collection

Question

I'm currently using MongoDB 2.6 through MongoHQ. I've several mapreduces jobs which crunch raw data from a collection (c1) to produce a new collection (c2). I've also an aggregation pipeline which parses (c2) to generate a new collection (c3) with the great $out operator.

However, I need to add extra fields to (c3) outside of the aggregation pipeline and keep them even after a new run of the aggregation but it seems that aggregation, based on the _id key just overwrite the content without updating it. So if I've previously add an extra field like foo : 'bar' to (c3) and I re-run the aggregation, I will loose the foo field.

Based on documentation (http://docs.mongodb.org/manual/reference/operator/aggregation/out/#pipe._S_out)

Replace Existing Collection

If the collection specified by the $out operation already exists, then upon completion of the aggregation, the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the pre-existing collection.

Is there a better way or a tricky one :-) to update the $out collection instead of overwriting records with same _id ? I could write a python script or javascript to do the job but I would to avoid doing many database calls and in a smarter way as aggregation. May be it is not possible, so I will look for a different and more 'classical' path.

Thanks for your help

Neil Lunn Neil Lunn · Accepted Answer · 2014-05-02T12:00:36

Well, not directly with the $out operator as much with the mapReduce output this is pretty much an "overwrite" operation (though mapReduce does have "merge" and "reduce" modes as well).

But since you have a MongoDB 2.6 version you do actually return a "cursor". So while the "client/server" interaction may not be as optimal as you want but you also have "bulk update" operations so you can do something along the lines of:

var cursor = db.collection.aggregate([
    // pipeline here
]);

var batch = [];

while ( cursor.hasNext() ) {
    var doc = cursor.next();

    var updoc = {
        "q": { "_id": doc._id },
        "u": {
            // only new fields except for
            "$setOnInsert": {
                // the fields you expect to add from before
            },
            "upsert": true
        }
    };

    batch.push(updoc);

    // try to do sensible under 16MB updates, number may vary
    if ( ( batch.length % 500 ) == 0 ) {
        db.runCommand({
            "update": "newcollection",
            "updates": batch
        });
        batch = [];    // reset the content
    }

}

db.runCommand({
    "update": "newcollection",
    "updates": batch
});

And of course, though there will be many naysayers, and not without reason because you really need to weigh up the consequences ( which are very real ), you can always wrap what is essentially a JavaScript call with db.eval() in order to get the full server side execution.

But where possible ( and that is unless you have a completely remote database solution ), then it is generally advised to take the "client/server" option, but keep the process as "close" ( in networking terms ) to the server as possible.

MongoDB 2.6 aggregation updates the $out collection

3 Answers

Target

1st Phase

2nd Phase

3rd Phase