I have a huge txt data store on which I want to gather some stats.
Using Hadoop-streaming and Python I know how to implement a MapReduce for gathering stats on a single column, e.g. count how many records there are for each of a 100 categories. I create a simple mapper.py and reducer.py, and plug them into the hadoop-streaming command as -mapper and -reducer respectively.
Now, I am at a bit of a loss at how to practically approach a more complex task: gathering various stats on various other columns in addition to categories above (e.g. geographies, types, dates, etc.). All that data is in the same txt files.
Do I chain the mapper/reducer tasks together? Do I pass key-value pairs initially long (with all data included) and "strip" them of interesting values one by one while processing? Or is this a wrong path? I need a practical advice on how people "glue" various MapReduce tasks for a single data source from within Python.