2
votes

I have a similar problem like this one

The followning are what I used:

  1. CDH4.4 (hive 0.10)
  2. protobuf-java-.2.4.1.jar
  3. elephant-bird-hive-4.6-SNAPSHOT.jar
  4. elephant-bird-core-4.6-SNAPSHOT.jar
  5. elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
  6. The jar file which include the protoc compiled .class file.

And I flow Protocol Buffer java tutorial create my data "testbook".

And I

use hdfs dfs -mkdir /protobuf_data to create HDFS folder.

Use hdfs dfs -put testbook /protobuf_data to put "testbook" to HDFS.

Then I follow elephant-bird web page to create table, syntax is like this:

create table addressbook
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
    "serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
  stored as
    inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
  LOCATION '/protobuf_data/';

All worked.

But when I submit the query select * from addressbook; no result came out.

And I couldn't find any logs with errors to debug.

Could someone help me ?

Many thanks

1

1 Answers

4
votes

The problem had been solved.

First I put protobuf binary data directly into HDFS, no result showed.

Because it doesn't work that way.

After asking some senior colleagues, they said protobuf binary data should be written into some kind of container, some file format, like hadoop SequenceFile etc.

The elephant-bird page had written the information too, but first I couldn't understand it completely.

After writing protobuf binary data into sequenceFile, I can read the protobuf data with hive.

And because I use sequenceFile format, so I use the create table syntax:

inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'

Hope it can help others who are new to hadoop, hive, elephant too.