Synchronize data to HBase/HDFS and use it as input to MapReduce job

Question

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.

This example might explain more:

Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.

How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.

Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?

Don Branson Don Branson · Accepted Answer · 2012-02-17T14:36:21

I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.

I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.

That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

Synchronize data to HBase/HDFS and use it as input to MapReduce job

1 Answers