0
votes

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.

This is the table create statement that I am using

CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
  PARTITIONED BY (dt string)
    ROW FORMAT SERDE 
        "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
    WITH serdeproperties (
      "serialization.class"="path.to.my.java.class.ProtoClass")
  STORED AS SEQUENCEFILE;

The table is created and I do not receive any runtime errors when I query the table.

When I attempt to load data as follows:

ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'

I receive an "OK" statement. However, when I query the table:

select * from test_messages limit 1;

I receive the following error:

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.

I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.

P.S. I have proven that I am able to run queries against hive table without partitioning.

Any thoughts ?

2

2 Answers

0
votes

I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.

My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.

0
votes

You need to specify a LOCATION for an EXTERNAL TABLE

CREATE EXTERNAL TABLE 
... 
LOCATION '/test';

Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.

Then, your table locations need to look like /test/dt=value in order for Hive to read them.

After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore