Hadoop Streaming Job with no input file

Question

Is it possible to execute a Hadoop Streaming job that has no input file?

In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.

We have 2 use cases in mind.
1)

I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.

You could potentially create your own InputFormat and play around. But it is an interesting requirement. Can we get more details on why you need map-reduce or what is the logic you are implementing? — Venkat
In the use case 1 - Are you reducing based on some key? Is this essentially a copy operation which you are trying to distribute? — Venkat

carpenter carpenter · Accepted Answer · 2015-08-10T19:58:26

According to the docs this is not possible. The following are required parameters for execution:

input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName

It looks like providing a dummy input file is the way to go currently.

Hadoop Streaming Job with no input file

1 Answers