3
votes

I have a parquet file that I made by converting some avro data file. The file contains complex records. Also I have avro schema of those records as well as equivalent parquet schema (I got it when I converted the file). I want to make a hive table backed by the parquet file.

Because my record schema has a lot of fields, it is very difficult and error prone to declare hive columns corresponding to those fields manually. That's why I want hive to define columns of the table backed by my parquet file using parquet schema of the records, in much the same way AvroSerDe uses avro schema to define table columns. Is this supported by ParquetSerDe? How can I do that?

P.S. I am aware of the possible workaround where I could define an avro backed table using avro schema first and then use CTAS statement to create parquet table from that. But that doesn't work if schema has unions becaus AvroSerDe uses Hive unions that hive has practically no support for (!!) and ParquetSerDe does not know how to handle them.

3

3 Answers

2
votes

I did a bit of research and got the answer, so here it is for anyone else that gets stuck with this:

ParquetSerDe currently has no support for any kind of table definition except pure DDL, where you must explicitely specify each column. There is a JIRA ticket that tracks adding support for definig a table using existing parquet file (HIVE-8950).

1
votes

We use Hive as part of CDH package, that also includes Impala.

Unlike Hive, Impala already has support for schema inference from Parquet files: http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_create_table.html

Notice

Column definitions inferred from data file:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE PARQUET 'hdfs_path_of_parquet_file'

This currently only works for Parquet and not AVRO files.

Because of this we actually have to use Impala on some of our workflows (e.g. after sqoop import into parquet files, or after distcp'ing from an external hadoop cluster - quite useful!).

0
votes

Unfortunately, there is no parquet.schema.literal available like avro.schema.literal which can be used to define a table using the schema.

You will have to create individual columns in table definition or use CTAS statements.

As far as union schema not working in hive. I have been using union schema definition in my avsc files for the data type field and it works pretty

This is the structure of my avsc:

{"namespace": "somename",
 "type": "record",
 "name": "somename",
 "fields": [
     {"name": "col1", "type": "string"},
     {"name": "col2", "type": "string"},
     {"name": "col3", "type": ["string","null"]},
     {"name": "col4", "type": ["string", "null"]},
     {"name": "col5", "type": ["string", "null"]},
     {"name": "col6", "type": ["string", "null"]},
     {"name": "col7", "type": ["string", "null"]},
     {"name": "col8", "type": ["string", "null"]}  
 ]
}