11
votes

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.

Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.

I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.

My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.

db.runCommand({
  'mapreduce': 'products',
  'map': function() {
    emit({
      product_id: this.product_id,
      popularity: this.popularity
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
  'sort': {popularity: -1}
});

I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:

{ 
  "v" : 1, 
  "key" : { "popularity" : -1 }, 
  "ns" : "app.products", 
  "name" : "popularity_-1" 
}

I just cannot figure out why it doesn't want to sort.

Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.

2
have you considered using 2.2 and aggregation framework?Asya Kamsky
Sorting currently isn't supported for an inline Map/Reduce (see SERVER-3973 to watch/vote on this feature). An inline Map/Reduce (or an aggregate query) is also limited to the max document size (currently 16Mb in MongoDB 2.0). What is the issue preventing output to a temporary collection which could be queried in sort order? Another option would be to sort the result in your client app, eg using usort().Stennie
@Aysa: I have, but 2.2 isn't production ready yet.Jon Ursenbach
@Stennie: Seems to be the case. The issue preventing me from outputting it to a temporary collection is that I can't have this feature creating a new collection for every request that comes into the app. The Mongo docs don't read like I can create a "temporary" collection anymore anyways. Of course I can delete it after I'm done, but that doesn't seem like a great solution. And I'm not worried about the 16MB document size limit at all.Jon Ursenbach
@JonUrsenbach: The second release candidate for MongoDB 2.2 came out this week (rc1, following rc0 in mid-July). I expect a production release isn't far off; would be worth testing.Stennie

2 Answers

15
votes

First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.

What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.

This is how I generated my dataset

function generate_dummy_data() {
    for (i=2; i < 1000000; i++) { 
        db.foobar.save({
          _id: i, 
         category_id: parseInt(Math.random() * 30), 
         popularity:    parseInt(Math.random() * 50)
        }) 
    }
}

And this my map/reduce task:

var data = db.runCommand({
  'mapreduce': 'foobar',
  'map': function() {
    emit({
      sorting: this.popularity * -1,
      product_id: this._id,
      popularity: this.popularity,
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
});

And this is the end result (very long to paste it here):

http://cesarodas.com/results.txt

This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.

Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).

9
votes

As noted in discussion on the original question:

  • Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see @crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().

  • OP's preference is for inline results rather than creating/deleting temporary collections.

  • The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.

Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:

db.products.aggregate(
  { $match: { category_id: 20 }},
  { $group : {
     _id : "$product_id",
     'popularity' : { $sum : "$popularity" },
  }},
  { $sort: { 'popularity': -1 }}
)

.. and sample output:

{
    "result" : [
        {
            "_id" : 50,
            "popularity" : 139
        },
        {
            "_id" : 150,
            "popularity" : 99
        },
        {
            "_id" : 123,
            "popularity" : 55
        }
    ],
    "ok" : 1
}