I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)
Hive can read the avro files as an external table using:
CREATE EXTERNAL TABLE as_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<location>'
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');
But when I try to create a parquet table:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
it throws an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>
Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?
I'm open to alternative, simpler methods as well. MR? Pig?
Looking for a way that's fast, simple and has minimal dependencies to bother about.
Thanks
nulland a single real type. The latter simply indicates an optional value and you can represent the same type in Parquet using theoptionalkeyword instead of a union. - Zoltan