How to specify schema for parquet data in hive 0.13+

Question

I have a parquet file that I made by converting some avro data file. The file contains complex records. Also I have avro schema of those records as well as equivalent parquet schema (I got it when I converted the file). I want to make a hive table backed by the parquet file.

Because my record schema has a lot of fields, it is very difficult and error prone to declare hive columns corresponding to those fields manually. That's why I want hive to define columns of the table backed by my parquet file using parquet schema of the records, in much the same way AvroSerDe uses avro schema to define table columns. Is this supported by ParquetSerDe? How can I do that?

P.S. I am aware of the possible workaround where I could define an avro backed table using avro schema first and then use CTAS statement to create parquet table from that. But that doesn't work if schema has unions becaus AvroSerDe uses Hive unions that hive has practically no support for (!!) and ParquetSerDe does not know how to handle them.

miljanm miljanm · Accepted Answer · 2015-01-15T10:46:46

I did a bit of research and got the answer, so here it is for anyone else that gets stuck with this:

ParquetSerDe currently has no support for any kind of table definition except pure DDL, where you must explicitely specify each column. There is a JIRA ticket that tracks adding support for definig a table using existing parquet file (HIVE-8950).

How to specify schema for parquet data in hive 0.13+

3 Answers