Thanks for all your messages.
As I do not want to use cursor (requests consuming) I try to get the job by combining 2 map reduces jobs and one aggregation. It is quite 'fat' but it works and could give some idea for others.
Of course, I would be very pleased hearing from you other great alternatives.
So, I have a collection c1 which is the result of a previous mapreduce job as you could see by the value object.
c1 : { id:'xxxx', value:{ language:'...', keyword: '...', params: '...', field1: val1, field2: val2}}
the xxxx unique ID key is the concatenation of the value.language , value.keyword and value.params as follow :
*xxxx = _*
I've got another collection c2 : { _id : ObjectID, language:'...', keyword:'...', field1: val1, field2: val2, labels: 'yyyyy'} which is quite a
projection of the c1 collection but with an extra field labels which is a string with different labels comma separated. This c2 collection is a central repository for all combination of language and keywords with their attached field values.
Target
The target is to group all records from the c1 collection based on the
group key _, make some calculations on
other fields and store the result to the c2 collection but by keeping
the old 'labels' field from c2 with the same key. So fields1 & 2 of
this c2 collection will be recalculated each time we launch the whole
batch but the labels field will stay unchanged.
As described in my first message, by using aggregation or mapreduce jobs you could not reach this target as the 'labels' field will be removed.
As I do not want to use cursors and other foreach loop which are very network and database resquests consuming (I have a big collection and I use a MongoHQ service)
I try to solve the problem by using mapreduce and aggregation jobs.
1st Phase
So, firstly I run a mapreduce job (m1) which is a sort of copy of the c2 collection but clearing the value of field1 & 2 to 0. The result will be store in a c3 collection.
function m1Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
labels = this['labels'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels});
}
function m1Reduce(key,values){
language = values[0]['language'];
keyword = values[0]['keyword'];
labels = values[0]['labels'];
return {'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels}};
}
So now, c3 is a copy of c2 collection with field1&2 set to 0. Here is the shape of this collection :
c3 : { id:'', value:{ language:'...', keyword: '...', field1: 0, field2: 0.0, labels: '...'}}
2nd Phase
In a second step I run a mapreduce job (m2) which group the c1 collection value by the key _ and I project an extra field 'labels' with a fixed value 'x' in my example. This 'x' value is never used on the c2 collection, that is a special value. The output of this m2 mapreduce job will be stored in the same previous c3 collection with a 'reduce' option in the out directive. The python script will be described further.
function m2Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
field1 = this['value']['field1'];
field2 = this['value']['field2'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': field1, 'field2': field2, 'labels' : 'x'});
}
Then I make some calculations on the Reduce function :
function m2Reduce(key,values){
// Init
language = values[0]['language'];
keyword = values[0]['keyword'];
field1 = 0;
field2 = 0;
bLabel = 0;
for (var i = 0; i < values.length; i++){
if (values[i]['labels'] == 'x') {
// We know these emit values are coming from the map and not from previous value on the c2 collection
// 'x' is never used on the c2 collection
field1 += parseInt(values[i]['field1']);
field2 += parseFloat(values[i]['field2']);
} else {
// these values are from the c2 collection
if (bLabel == 0) {
// we keep the former value for the 'labels' field
labels = values[i]['labels'];
bLabel = 1;
} else {
// we concatenate the 'labels' field if we have 2 records but theorytically it is impossible as c2 has only one record by unique key
// anyway, a good check afterwards :-)
labels += ','+values[i]['labels'];
}
}
}
if (bLabel == 0) {
// if values are only coming from the map emit, we force again the 'x' value for labels, it these values are re-used in another reduce call
labels = 'x';
}
return {'language':language,'keyword':keyword, 'field1': field1, 'field2': field2, 'labels' : labels};
}
The Python mapreduce script which calls the two m1 & m2 mapreduce jobs
(see pymongo for import : http://api.mongodb.org/python/2.7rc0/installation.html)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pymongo import MongoClient
from pymongo import MongoReplicaSetClient
from bson.code import Code
from bson.son import SON
# MongoHQ
uri = 'mongodb://user:passwd@url_node1:port,url_node2:port/mydb'
client = MongoReplicaSetClient(uri,replicaSet='set-xxxxxxx')
db = client.mydb
coll1 = db.c1
coll2 = db.c2
#Load map and reduce functions
m1_map = Code(open('m1Map.js','r').read())
m1_reduce = Code(open('m1Reduce.js','r').read())
m2_map = Code(open('m2Map.js','r').read())
m2_reduce = Code(open('m2Reduce.js','r').read())
#Run the map-reduce queries
results = coll2.map_reduce(m1_map,m1_reduce,"c3",query={})
results = coll1.map_reduce(m2_map,m2_reduce,out=SON([("reduce", "c3")]),query={})
3rd Phase
At this point, we have a c3 collection which is complete with all field 1 & 2 computed values and the labels kept. So now, we have to run a last aggregation pipeline to copy the c3 content (in a mapreduce form with a compound value) to a more classical collection c2 with flatten fields without the value object.
db.c3.aggregate([{$project : { _id: 0, keyword: '$value.keyword', language: '$value.language', field1: '$value.field1', field2 : '$value.field2', labels : '$value.labels'}},{$out:'c2'}])
Et voilĂ ! The target is reached. This solution is quite long with 2 mapreduce jobs and one aggregation pipeline but this is an alternative solution for those who do not want to use consuming cursor or external loop.
Thanks.