With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.
Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.
I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?
Update
After I have set these params in mapred-site.xml
<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>
and these params in hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>
And this value in core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>
but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.
I have listed all directories in hdfs dfs -ls -R /
and in the tmp
dir I have only found the job configuration files.
drwx------ - root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r-- 10 root supergroup 112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r-- 10 root supergroup 6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r-- 1 root supergroup 797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r-- 1 root supergroup 88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r-- 1 root supergroup 439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r-- 1 root supergroup 105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml
Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.
hdfs:///tmp
? More importantly, though, why do you need that data? – OneCricketeer<output dir>/_temporary/1/_temporary
– xeon