Understanding more about Hadoop/HDFS Data Loading

Question

im researching Hadoop and MapReduce (I'm a beginner!) and have a simple question regarding HDFS. I'm a little confused about how HDFS and MapReduce work together.

Lets say I have logs from System A, Tweets, and a stack of documents from System B. When this is loaded into Hadoop/HDFS, is this all thrown into one big HDFS bucket, or would there be 3 areas (for want of a better word)? If so, what is the correct terminology?

The questions stems from understanding how to execute a MapReduce job. If I only wanted to concentrate on the Logs for example, can this be done, or are all jobs executed on the entire content stored on the cluster?

Thanks for your guidance! TM

harpun harpun · Accepted Answer · 2013-02-08T18:37:56

HDFS is a file system. As in your local filesystem you can organize all your logs and documents into multiple files and directories. When you run MapReduce jobs you usually specify a directory with your input files. Thus it is possible to execute a job only on the logs from system A or the documents from system B.

However the input for your mappers is specified by the InputFormat. Most implementations originate from FileInputFormat which reads files. However it is possible to implement custom InputFormats in order to read data from other sources. You can find an explanation on input and output formats in this Hadoop Tutorial.

Understanding more about Hadoop/HDFS Data Loading

1 Answers