11
votes

I was trying to use MongoDB 2.4.3 (also tried 2.4.4) with mapReduce on a cluster with 2 shards with each 3 replicas. I have a problem with results of the mapReduce job not being reduced into output collection. I tried an Incremental Map Reduce. I also tried "merging" instead of reducing, but that didn't work either.

The map reduce command run on mongos: (coll isn't sharded)

db.coll.mapReduce(map, reduce, {out: {reduce: "events", "sharded": true}})

Which yields the following output:

{
    "result" : "events",
    "counts" : {
        "input" : NumberLong(2),
        "emit" : NumberLong(2),
        "reduce" : NumberLong(0),
        "output" : NumberLong(28304112)
    },
    "timeMillis" : 418,
    "timing" : {
        "shardProcessing" : 11,
        "postProcessing" : 407
    },
    "shardCounts" : {
        "stats2/192.168.…:27017,192.168.…" : {
            "input" : 2,
            "emit" : 2,
            "reduce" : 0,
            "output" : 2
        }
    },
    "postProcessCounts" : {
        "stats1/192.168.…:27017,…" : {
            "input" : NumberLong(0),
            "reduce" : NumberLong(0),
            "output" : NumberLong(14151042)
        },
        "stats2/192.168.…:27017,…" : {
            "input" : NumberLong(0),
            "reduce" : NumberLong(0),
            "output" : NumberLong(14153070)
        }
    },
    "ok" : 1,
}

So I see that the mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0. Also trying to find the record with a search on _id yields no result. In the log file of MongoDB I wasn't able to find error messages related to this.

After trying to reproduce this with a newly created output collection, that I also sharded on hashed _id and I also gave the same indexes, I wasn't able to reproduce this. When outputting the same input to a different collection

db.coll.mapReduce(map, reduce, {out: {reduce: "events_test2", "sharded": true}})

The result is stored in the output collection and I got the following output:

{
    "result" : "events_test2",
    "counts" : {
        "input" : NumberLong(2),
        "emit" : NumberLong(2),
        "reduce" : NumberLong(0),
        "output" : NumberLong(4)
    },
    "timeMillis" : 321,
    "timing" : {
        "shardProcessing" : 68,
        "postProcessing" : 253
    },
    "shardCounts" : {
        "stats2/192.168.…:27017,…" : {
            "input" : 2,
            "emit" : 2,
            "reduce" : 0,
            "output" : 2
        }
    },
    "postProcessCounts" : {
        "stats1/192.168.…:27017,…" : {
            "input" : NumberLong(2),
            "reduce" : NumberLong(0),
            "output" : NumberLong(2)
        },
        "stats2/192.168.…:27017,…" : {
            "input" : NumberLong(2),
            "reduce" : NumberLong(0),
            "output" : NumberLong(2)
        }
    },
    "ok" : 1,
}

When running the script again with the same input ouputting again in the second collection, it shows that it is reducing in postProcessCounts. So the map and reduce functions do their job fine. Why doesn't it work on the larger first collection? Am I doing something wrong here? Are there any special limitations on collections that can be used as output for map-reduce?

1
for simplicity, since this collection isn't sharded (and is small) why don't you run mapreduce into an unshared output collection?Asya Kamsky
also initially you say coll is not sharded but then later you say you try again with a new collection that you also sharded. so you lost me on whether the initial collection is sharded and why you are sharding the output collection.Asya Kamsky
The input collection isn't sharded, but the output collections are. So, the problem is: in the first sharded out collection no output is written, although in the second sharded out collection output is written. For testing purposes I used a small input here to make it easier to see what is going on, I was planning to do this with larger inputs in the future. Also, updating existing records (with reduce, see docs.mongodb.org/manual/tutorial/perform-incremental-map-reduce ) is very convenient.Mark
I've tried multiple permutations like you describe and I cannot reproduce your problem.Asya Kamsky
I have the same problem with Mongo 3.04, did you found a workaround?josebetomex

1 Answers

0
votes

mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0.

Map is run over 2 records. If those two records have a different key then the Map will output 2 keys and a value for each. Which is normal.

But something that I noticed in an older version of MongoDB (not sure if this applies in your case) is that if the "values array " for the reduce phase have a length, then reducing will be skipped.

Is the output collection empty in the first case?