I am trying to use orc as inputformat for hadoop streaming
here is how i run it
export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \
But i am getting this error:
Error: java.io.IOException: Split class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361) ... 7 more
it looks like OrcSplit class should be in hive-exec.jar