specifying own inputformat for streaming job

Question

I defined my own input format as follows which prevents file spliting:

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;

public class NSTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

I compiled this using Eclipse into a class NSTextInputFormat.class. I copied this class to a client from where the job is launched. I used following command for launching the job and passing above class as inputformat.

hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.job.queue.name=unfunded -input 24222910/framefile -input 24225109/framefile -output Output -inputformat NSTextInputFormat -mapper ExtractHSV -file ExtractHSV -file NSTextInputFormat.class -numReduceTasks 0

This fails saying: -inputformat : class not found : NSTextInputFormat Streaming Job Failed!

I set the PATH and CLASSPATH variable to the directory containing NSTextInputFormat.class, but still that doesnot work. Any pointers to this will be helpful.

wickerwaka wickerwaka · Accepted Answer · 2013-12-27T18:27:58

There are a few gotchas here that can get you if you are not familiar with Java.

-inputformat (and the other commandline options that expect classnames) expects a fully qualified classname, otherwise it expects to find the class in some org.apache.hadoop... namespace. So you must include a package name in you .java file

package org.example.hadoop;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;

public class NSTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

And the specify the full name on the commandline:

-inputformat org.example.hadoop.NSTextInputFormat

When you build the jar file the .class file must also be in a directory structure that mirrors the package name. I'm sure this is Java Packaging 101, but if you are using Hadoop Streaming then you probably aren't too familiar with Java in the first place. Passing the -d option to javac will tell it to compile the input files into .class files in directories that match the package name.

javac -classpath `hadoop classpath` -d ./output NSTextInputFormat.java

The compiled .class file will be written to ./output/org/example/hadoop/NSTextInputFormat.class. You will need to create the output directory but the other sub-directories will be created for you. The jar file can then be created like so:

jar cvf myjar.jar -C ./output/ .

And you should see some output similar to this:

added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/example/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/NSTextInputFormat.class(in = 372) (out= 252)(deflated 32%)

specifying own inputformat for streaming job

2 Answers