3
votes

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.

My try is to use below DDL:

CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');

My Hive table is successfully created with the right schema, but when I try to read the data :

SELECT * FROM parquet_test;

I get the following error :

java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable

Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?

1
If the underlying data is in parquet format, I believe it is not possible to read using avro schema. It is more like reading french book with english dictionary in hand. - Ram Manohar
I'm only using AvroSerDe for schema inference... - tmouron
Above error(Expecting a AvroGenericRecordWritable) indicates you are trying to read parquet format record with AvroSerde. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' tells hive to use AvroSerde to decode the data, but data is not in avro format. - Ram Manohar
Yep, I'm trying to find an alternative solution, maybe I should generate a Hive query from an Avro schema. But I'm surprised that there is not a simpler solution. - tmouron
I haven't tested but try this CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath'; - Ram Manohar

1 Answers

12
votes

Below query works:

CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); 

CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';