Athena - creating external table with field name different from parquet column name

Question

I'm trying to create an external table using Athena. The data being read is formatted as parquet and my external table script is :

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    a string,  
    b string,  
    y string  
) PARTITIONED BY (
    year bigint,
    month bigint,
    day bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '1'
) LOCATION 's3://my_path/to/parquet'
TBLPROPERTIES ('has_encrypted_data'='false');

However, my parquet column names are a, b, x . How can I map the field x to have y as name on my external table?

On hive, you could create a view on top of the table, with the column name changed. Unfortunately Athena does not support views yet, but their engineering team told us that they are working on supporting views. Remapping in the parquet serde is not possible, unlike in some other serdes. — Roberto Congiu
Yeah, I found a way to remap using json serde, but couldn't find for parquet. Thank you mate. — Rodrigo Ney
Yeah, that one does have the remapping... by the way, I wrote it :) — Roberto Congiu

user9009922 user9009922 · Accepted Answer · 2018-07-27T23:32:00

Actually, this is possible, with some drawbacks.

In Athena, Tables in parquet are read by names by default. This gives you the flexibility of reordering the columns in your table or adding new columns in the middle of the table.

If you can live without it, you can turn on index access to columns, by specifying

WITH SERDEPROPERTIES ('parquet.column.index.access'='true')

In your case, this would look like

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    a string,  
    b string,  
    y string  
) PARTITIONED BY (
    year bigint,
    month bigint,
    day bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'parquet.column.index.access'='true'
    'serialization.format' = '1'
) LOCATION 's3://my_path/to/parquet'
TBLPROPERTIES ('has_encrypted_data'='false');

Note that this requires that your partition colums are in the same order as you wrote them in the DDL statement.

You can read more about this issue at AWS Documentation Repo

Athena - creating external table with field name different from parquet column name

1 Answers