Read Snappy compressed Hive RCFile in Apache Pig

Question

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html

Fies have RCF, SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701).

However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:

rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;

I get error like this:

2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
    at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
    at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
    at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
    at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
    at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
    at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
    at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...

I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).

Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?

Thanks in advance!

user2370813 user2370813 · Accepted Answer · 2014-11-03T22:48:16

Have you tried the HCatLoader?

rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

Read Snappy compressed Hive RCFile in Apache Pig

1 Answers