Command line for hadoop streaming

Question

I am trying to use hadoop streaming where I have a java class which is used as mapper. To keep the problem simple let us assume the java code is like the following:

import java.io.* ;

class Test {

    public static void main(String args[]) {
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
            String input ;
            while ((input = br.readLine()) != null) {
                  System.out.println(input) ;
            }
        } catch (IOException io) {
            io.printStackTrace() ;
        }
    }
}

I can compile it as "javac Test.java" run it from command line as follows:

[abhattac@eat1-hcl4014 java]$ cat a.dat
abc
[abhattac@eat1-hcl4014 java]$ cat a.dat | java Test
abc
[abhattac@eat1-hcl4014 java]

Let us assume that I have a file in HDFS: a.dat

[abhattac@eat1-hcl4014 java]$ hadoop fs -cat /user/abhattac/a.dat
Abc

[abhattac@eat1-hcl4014 java]$ jar cvf Test.jar Test.class
added manifest
adding: Test.class(in = 769) (out= 485)(deflated 36%)
[abhattac@eat1-hcl4014 java]$

Now I try to use (Test.java) as mapper in hadoop streaming. What do I provide for [1] -mapper command line option. Should it be like the following? [2] -file command line option. Do I need to make a jar file out of Test.class? If that is the case do I need to include MANIFEST.MF file to indicate the main class?

I tried all these options but none of them seem to work. Any help will be appreciated.

hadoop jar /export/apps/hadoop/latest/contrib/streaming/hadoop-streaming-1.2.1.45.jar -file Test.jar -mapper 'java Test' -input /user/abhattac/a.dat -output /user/abhattac/output

The command above doesn't work. The error message in task log is:

stderr logs

Exception in thread "main" java.lang.NoClassDefFoundError: Test
Caused by: java.lang.ClassNotFoundException: Test
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

Why are you using hadoop streaming with java ? Hadoop streaming is generally used when writing map/reduce function in languages different than Java. — SelimN

Nonnib Nonnib · Accepted Answer · 2014-09-24T10:17:05

Since hadoop streaming is just shoveling work through stdin to a command line executable you can just run "java Test" on your Test.class like you would locally. There's no need to package to a jar.

I ran this successfully myself using your code:

hadoop jar hadoop-streaming.jar -file Test.class -mapper 'java Test' -input /input -output /output

SelimN is right that this is a pretty odd way to go about it though since you could just as well be writing a native java mapper.

Streaming is usually used when you want to use a scripting language such as bash or python instead of using Java.

Command line for hadoop streaming

1 Answers