I have an application that creates text files with one line each and dumps it to hdfs. This location is in turn being used as the input directory for a hadoop streaming job.
The expectation is that the number of mappers will be equal to the "input file split" which is equal to the number of files in my case. Some how all the mappers are not getting triggered and I see a weird issue in the streaming output dump:
Caused by: java.io.IOException: Cannot run program "/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1411140750872_0001/container_1411140750872_0001_01_000336/./CODE/python_mapper_unix.py": error=26, Text file busy
"python_mapper.py" is my mapper file.
Environment Details: A 40 node aws r3.xlarge AWS EMR cluster [No other job runs on this cluster] When this streaming jar is running, no other job is running on the cluster, hence none of the external processes should be trying to open the "python_mapper.py" file
Here is the streaming jar command:
ssh -o StrictHostKeyChecking=no -i hadoop@ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -files CODE -file CODE/congfiguration.conf -mapper CODE/python_mapper.py -input /user/hadoop/launchidlworker/input/1 -output /user/hadoop/launchidlworker/output/out1 -numReduceTasks 0