0
votes

Here is my code that connects to hadoop machine and perform set of validation and write on another directory.

      public class Main{

            public static void main(String...strings){

        System.setProperty("HADOOP_USER_NAME", "root");
        String in1 = "hdfs://myserver/user/root/adnan/inputfile.txt";
        String out = "hdfs://myserver/user/root/cascading/temp2";

        Properties properties = new Properties();
        AppProps.setApplicationJarClass(properties, Main.class);
        HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);

        Tap inTap = new Hfs(new TextDelimited(true, ","), in1);
        Tap outTap = new Hfs(new TextDelimited(true, ","), out);

        Pipe inPipe = new Pipe("in1");  

        Each removeErrors = new Each(inPipe, Fields.ALL, new BigFilter());
        GroupBy group = new GroupBy(removeErrors, getGroupByFields(fieldCols));
        Every mergeGroup = new Every(group, Fields.ALL, new MergeGroupAggregator(fieldCols), Fields.RESULTS);

        FlowDef flowDef = FlowDef.flowDef()
                .addSource(inPipe, inTap)
                .addTailSink(mergeGroup, outTap);

        flowConnector.connect(flowDef).complete();

}

My job is getting submitted to hadoop machine. I can check this on job tracker. but job is getting failed and I am getting exception below.

cascading.tap.hadoop.io.MultiInputSplit not found at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:348) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:389) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.ClassNotFoundException: Class cascading.tap.hadoop.io.MultiInputSplit not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:346) ... 7 more

java.lang.ClassNotFoundException: Class cascading.tap.hadoop.io.MultiInputSplit not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)

Note that : 1. I am running this from my windows machine and hadoop is setup on different box. 2. I am using cloudera distribution for hadoop which is CDH 4.

2
@pacoid could you please have a look?Mohammad Adnan
got the issue. CDH 4.2 has issue with cascading 2.1. So changed to CDH 4.1 and it worked for me.Mohammad Adnan

2 Answers

0
votes

Your Properties file is empty so it may be that your configuration for this job is off on the cluster. You must provide the configuration that you are using to HadoopFlowController. The information contained in your Hadoop Configuration files found when you call new Configuration belongs within your Properties object -things like fs.default.name=file://// etc. I imagine this is even more so the case when you are running a Cascading job across the "wire".

0
votes

got the issue. CDH 4.2 has issue with cascading 2.1. So changed to CDH 4.1 and it worked for me.