1
votes

I am trying to use orc as inputformat for hadoop streaming

here is how i run it

export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -file /home/mr/mapper.py -mapper /home/mr/mapper.py \
    -file /home/mr/reducer.py -reducer /home/mr/reducer.py \
    -input /user/cloudera/input/users/orc \
    -output /user/cloudera/output/simple \
    -inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \

But i am getting this error:

Error: java.io.IOException: Split class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361) ... 7 more

it looks like OrcSplit class should be in hive-exec.jar

2

2 Answers

1
votes

An easier solution is to have hadoop-streaming distribute the lib jars for you by using the -libjars argument. This argument takes a comma-separated list jars. To take your example, you could do:

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -libjars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
    -file /home/mr/mapper.py -mapper /home/mr/mapper.py \
    -file /home/mr/reducer.py -reducer /home/mr/reducer.py \
    -input /user/cloudera/input/users/orc \
    -output /user/cloudera/output/simple \
    -inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
0
votes

I found the answer. my problem was that i set HADOOP_CLASSPATH var only on one node. So i should either set it on everynode or use distrbuted cache