2
votes

I'm trying to create an external table using Athena. The data being read is formatted as parquet and my external table script is :

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    a string,  
    b string,  
    y string  
) PARTITIONED BY (
    year bigint,
    month bigint,
    day bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '1'
) LOCATION 's3://my_path/to/parquet'
TBLPROPERTIES ('has_encrypted_data'='false');

However, my parquet column names are a, b, x . How can I map the field x to have y as name on my external table?

1
Unfortunately that is not possible. - Roberto Congiu
Oh, that's sad :( - Rodrigo Ney
On hive, you could create a view on top of the table, with the column name changed. Unfortunately Athena does not support views yet, but their engineering team told us that they are working on supporting views. Remapping in the parquet serde is not possible, unlike in some other serdes. - Roberto Congiu
Yeah, I found a way to remap using json serde, but couldn't find for parquet. Thank you mate. - Rodrigo Ney
Yeah, that one does have the remapping... by the way, I wrote it :) - Roberto Congiu

1 Answers

0
votes

Actually, this is possible, with some drawbacks.

In Athena, Tables in parquet are read by names by default. This gives you the flexibility of reordering the columns in your table or adding new columns in the middle of the table.

If you can live without it, you can turn on index access to columns, by specifying

WITH SERDEPROPERTIES ('parquet.column.index.access'='true')

In your case, this would look like

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    a string,  
    b string,  
    y string  
) PARTITIONED BY (
    year bigint,
    month bigint,
    day bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'parquet.column.index.access'='true'
    'serialization.format' = '1'
) LOCATION 's3://my_path/to/parquet'
TBLPROPERTIES ('has_encrypted_data'='false');

Note that this requires that your partition colums are in the same order as you wrote them in the DDL statement.

You can read more about this issue at AWS Documentation Repo