1
votes

I am attempting to use Map/Reduce to accomplish partial merges into an existing collection. I have the MR working correctly but am having troubles returning the merged results.

Here are the stats on the MR with output type of reduced:

{ 
    "result" : "calculation",
    "timeMillis" : 222,
    "counts" : { 
        "input" : 492,
        "emit" : 920,
        "reduce" : 64,
        "output" : 435078
     },
    "ok" : 1.0
 }

I would expect output to be the number of docs actually merged, not the entire collection. Is there any way to do this?

I tried to merge a modified:true flag into the target docs. This way a query could be made that returns only the documents that were modified in the target collection. After the query, I then set flag back to false.

While this works correctly, it starts thrashing the index because of the massive amount of changes being made then flipped back, so the HD rate shoots up and MR performance plummets.

Ideally, calling result.GetResults() from the C# driver would naturally return the documents that were modified by the MR without the need to use flags.

Update:

Specifically, I have one collection that is "write only" which the MR runs on to merge into a "read" collection.

If there was a document set like

{
   "_id":BsonId,
   "key":"key1",
   "valarray":["one"],
},
{
   "_id":BsonId
   "key":"key2"
   "valarray":["one"]
}

then MR into the blank query collection would yield

{
  "_id":"key1",
  "value":
  {
     "valarray":["one"]
  }
},
{
  "_id":"key2",
  "value":
  {
     "valarray":["one"]
  }
}

and I would expect that the counts would be: input = 2, emit = 2, reduce = 0, output = 2

If then there was a new document inserted into the write collection

{
   "_id":BsonId,
   "key":"key1",
   "valarray":["two"],
}

then the map-reduce collection would be

{
  "_id":"key1",
  "value":
  {
     "valarray":["one", "two"]
  }
},
{
  "_id":"key2",
  "value":
  {
     "valarray":["one"]
  }
}

The counts are then: input = 1, emit = 1, reduce = 1, output = 2

And through the C# driver, calling result.GetResults() would iterate over the whole target collection. The issue is that I do not want to iterate over the collection, I only want to iterate over the documents in the target collection that were modified by the MR. In this case, it should return "_id":"key1" but not "_id":"key2".

1
So what is your question exactly? Perhaps you could show the problem you are trying to solve, your mapReduce code and a sample of different documents that you are trying to work with. - Neil Lunn
Thanks, is the update enough? - mikkelfishman
It does show where you want to get to. But only the code shows how you are getting there and where it is falling short. But at a guess, you currently have no way of knowing which items are actually in your target in order to determine what to update or insert. - Neil Lunn
Yes, under the docs, it says the option for reduce is "Merge the new result with the existing result if the output collection already exists. If an existing document has the same key as the new result, apply the reduce function to both the new and the existing documents and overwrite the existing document with the result." I would think that the result coming back from the MR should represent the docs that were merged, not the entire collection. This is definitely a core implementation question. - mikkelfishman
Well it's up to you whether you want to post code or not. But I can't see how people can see where this is tripping up on your expectations unless you do. It is part of the question after all. - Neil Lunn

1 Answers

0
votes

Nutshell of problem. You have a relatively small number of documents to merge, but this is thrashing out against the whole collection. You don't want it to.

The thing here would be that you want to apply a reduce function over not only the output documents resulting from the input stage but of course over the documents that already exists. So the implementation seems to be to run the reduce over the whole output collection in order to merge with the results.

So what you want is a targeted result, where only the documents being updated are actually modified. There is a way I can see to achieve this but it is going to take some steps. And a bit more code.

  1. Run your regular mapReduce operation. But instead of directing the output to your target collection, output to a temporary input collection.

  2. Using the keys from that output get the required modified documents from your target and insert those into a temporary target collection.

  3. Run a modified mapReduce that takes the temporary input and applies your reduce function through that and the temporary target collection items. This part is doing the work you want but only on the items to be updated, and in a smaller collection.

  4. Once modified, take that input and apply with update operations on your main target.

So once thinking like that, then you have a workaround to get the results you want in the target without the output stage doing all the thrashing over all the collection documents. The trade-off is in the extra steps, but the gains would seem to outweigh the performance problems incurred by doing this in one step.