0
votes

I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)

Hive can read the avro files as an external table using:

CREATE EXTERNAL TABLE as_avro 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION '<location>' 
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');

But when I try to create a parquet table:

create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'

it throws an error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>

Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?

I'm open to alternative, simpler methods as well. MR? Pig?

Looking for a way that's fast, simple and has minimal dependencies to bother about.

Thanks

1
Check whether the union is a real union or just a union of null and a single real type. The latter simply indicates an optional value and you can represent the same type in Parquet using the optional keyword instead of a union. - Zoltan
@Zoltan It's a real union but even if I were able to, say, select a few fields from within it using hive and then write to Parquet that would also be okay. Lateral view explode() also fails on the array field which contains the union type. - noobman

1 Answers

0
votes

Try splitting this:

create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'

into 2 steps:

  1. CREATE EXTERNAL TABLE as_parquet (col1 col1_type, ... , coln coln_type) STORED AS parquet LOCATION 'hdfs:///xyz.parquet';

  2. INSERT INTO TABLE as_parquet SELECT * FROM as_avro;

Or, if you have partitions, which I guess you have for this amount of data:

INSERT INTO TABLE as_parquet PARTITION (year=2016, month=07, day=13) SELECT <all_columns_except_partition_cols> FROM as_avro WHERE year='2016' and month='07' and day='13';

Note:
For step 1, in order to save any typos or small mistakes in columns types and such, you can:

  • Run SHOW CREATE TABLE as_avro and copy the create statement of as_avro table
  • Replace the table name, file format and location of the table
  • Run the new create statement.

This works for me...