3
votes

When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.

But when I run it with a collection with about 140.000 documents, I get some strange results:

Map function:

function() { emit(this.featureType, this._id); }

Reduce function:

function(key, values) { return { count: values.length, ids: values };

As a result, I would expect something like (for each mapping key):

{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
           "ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
                    "db364b3f-045f-4cb8-a52e-2267df40066c",
                    "d2152826-6777-4cc0-b701-3028a5ea4395",
                    "7ba366ae-264a-412e-b653-ce2fb7c10b52",
                    "513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}

But instead I get this strange document structure:

{
"_id": "FEATURE_TYPE_A",
"value": {
    "count": 706,
    "ids": [
        {
            "count": 101,
            "ids": [
                {
                    "count": 100,
                    "ids": [
                        "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
                        "db364b3f-045f-4cb8-a52e-2267df40066c",
                        "d2152826-6777-4cc0-b701-3028a5ea4395",
                        "7ba366ae-264a-412e-b653-ce2fb7c10b52",
                        "513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}

Could someone explain me if this is the expected behavior, or am I doing something wrong?

Thanks in advance!

1
The number of documents would seem to be your problem. 140,000 is a lot to dump out into what seemingly is only a few (or by your example 1 ) arrays. Why the need to do this? Interestingly it does work with aggregate. - Neil Lunn
I'm saving the output to a new collection, and the size of the resulting document is not bigger than 16mb, so as far as I understand, the system should be able to correctly manage it. - user1275011
There is a reason for this which is included in the documentation. The response explains this and how to correct with various methods. - Neil Lunn

1 Answers

7
votes

The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.

  • MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.

Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:

  • the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:

Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.

So first the mapper revised:

function () { emit(this.type, { count: 1, ids: [this._id] }); }

Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:

function (key, values) {

  var ids = [];
  var count = 0;

  values.forEach(function(value)  {
    count += value.count;
    value.ids.forEach(function(id) {
      ids.push( id );
    });
  });

  return { count: count, ids: ids };

}

What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially

  • Reduce one chunk of results #chunk1
  • Reduce another chunk of results #chunk2
  • Comnine the reduce on the reduced chunks, #chunk1 and #chunk2

That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.


The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.

db.collection.aggregate([
    { "$group": {
        "_id": "$featureType",
        "count": { "$sum": 1 },
        "ids": { "$push": "$_id" }
    }},
    { "$out": "ouputCollection" }
])

So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.

But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:

db.collection.aggregate([
    { "$group": {
        "_id": "$featureType",
        "count": { "$sum": 1 },
    }}
])

In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.