Use aggregation framework to get peaks from a pre-aggregated dataset

Question

I have a few collections of metrics that are stored pre-aggregated into hour and minute collections like this:

"_id" : "12345CHA-2RU020130104",
"metadata" : {
                "adaptor_id" : "CHA-2RU",
                "processor_id" : NumberLong(0),
                "date" : ISODate("2013-01-04T00:00:00Z"),
                "processor_type" : "CHP",
                "array_serial" : NumberLong(12345)
        },
"hour" : {
            "11" : 4.6665907,
            "21" : 5.9431519999999995,
            "7" : 0.6405864,
            "17" : 4.712744,
        ---etc---
    },  
"minute" : {
            "11" : {
                "33" : 4.689972,
                "32" : 4.7190895,
            ---etc---                        
                },
            "3" : {
                "45" : 5.6883,
                "59" : 4.792,
            ---etc---
            }

The minute collection has a sub-document for each hour which has an entry for each minute with the value of the metric at that minute.

My question is about the aggregation framework, how should I process this collection if I wanted to find all minutes where the metric was above a certain highwater mark? Investigating the aggregation framework is showing an $unwind function but that seems to only work on arrays..

Would the map/reduce functionality be better suited for this? With that I could simply emit any entry above the highwatermark and count them.

$project may be useful for transforming your objects into something you can aggregate on downstream. Not submitting this as an answer because it's not an exact fit for what you're doing, but I explored several different ad-hoc aggregation techniques here: devsmash.com/blog/… — jmar777
@Chris: I think you're stuck with MapReduce for this. The aggregate operators don't have any mechanisms for using "keys" as "values". — mjhm
I wouldn't say I'm stuck with it, it's not bad, I was just wondering if there were performance improvements to be had with the aggregation framework. — Chris Matta

sambomartin sambomartin · Accepted Answer · 2013-01-30T00:13:53

You could build an array of "keys" using a reduce function that iterates through the objects attributes.

 reduce: function(obj,prev)
 {
    for(var key in obj.minute) {
        prev.results.push( { hour:key, minutes: obj.minute[key]});
    }
 }

will give you something like

  {
          "results" : [
                  {
                          "hour" : "11",
                          "minutes" : {
                                  "33" : 4.689972,
                                  "32" : 4.7190895
                          }
                  },
                  {
                          "hour" : "3",
                          "minutes" : {
                                  "45" : 5.6883,
                                  "59" : 4.792
                          }
                  }
          ]
  }

I've just done a quick test using a group() - you'll need something more complex to iterate though the sub-sub documents (minutes) but hopefully points you in right direction.

db.yourcoll.group( { initial: { results: [] }, reduce: function(obj,prev) { for(var key in obj.minute) { prev.results.push( { hour:key, minutes: obj.minute[key]}); } } } );

In the finalizer you could reshape the data again. It's not going to be pretty, it might be easier to hold the minute and hour data as arrays rather than elements of the document.

hope it helps a bit

Use aggregation framework to get peaks from a pre-aggregated dataset

1 Answers