Python & MapReduce: beyond basics -- how to do more tasks on one database

Question

I have a huge txt data store on which I want to gather some stats.

Using Hadoop-streaming and Python I know how to implement a MapReduce for gathering stats on a single column, e.g. count how many records there are for each of a 100 categories. I create a simple mapper.py and reducer.py, and plug them into the hadoop-streaming command as -mapper and -reducer respectively.

Now, I am at a bit of a loss at how to practically approach a more complex task: gathering various stats on various other columns in addition to categories above (e.g. geographies, types, dates, etc.). All that data is in the same txt files.

Do I chain the mapper/reducer tasks together? Do I pass key-value pairs initially long (with all data included) and "strip" them of interesting values one by one while processing? Or is this a wrong path? I need a practical advice on how people "glue" various MapReduce tasks for a single data source from within Python.

Ashish Ashish · Accepted Answer · 2015-02-27T10:14:54

This question seems very generic to me. Chain of many map-reduce jobs are the most common pattern for the production ready solutions. But as programmer, we should always try to use less number of MR jobs to get the best performance (You have to be smart in selecting your key-value pairs for the jobs in order to do this) but off course it is dependent on the use cases. Some people use different combinations of Hadoop Streaming, Pig, Hive, JAVA MR etc. MR jobs to solve one business problem. With the help of any workflow management tools like Oozie or bash scripts you can set the dependencies between the jobs. And for exporting/importing data between RDBMS and HDFS, you can use Sqoop.

This is the very basic answer of your query. If you want to have further explanation for any point then let me know.

Python & MapReduce: beyond basics -- how to do more tasks on one database

1 Answers