0
votes

With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.

Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.

I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?

Update

After I have set these params in mapred-site.xml

<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>

and these params in hdfs-site.xml

<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>

And this value in core-site.xml

<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>

but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.

I have listed all directories in hdfs dfs -ls -R / and in the tmp dir I have only found the job configuration files.

drwx------   - root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r--  10 root supergroup     112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r--  10 root supergroup       6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r--   1 root supergroup        797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r--   1 root supergroup      88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r--   1 root supergroup     439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r--   1 root supergroup     105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml

Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.

3
Have you tried looking in hdfs:///tmp? More importantly, though, why do you need that data?OneCricketeer
In the tmp dir I can only find the configuration file and not the task output. I want to access data as they are being produced.xeon
I'm not entirely sure that intermediate steps of a MapReduce job are human-readable.OneCricketeer
The output of a task is in <output dir>/_temporary/1/_temporaryxeon

3 Answers

1
votes

The output put of a task is in <output dir>/_temporary/1/_temporary.

0
votes

HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.

-1
votes

Answer from this stackoverflow link:

It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.

Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.

There is some more information in the same post.