2
votes

I wrote an ETL job to convert a bunch of JSON files into time partitioned parquet files (objects) stored on S3.

Instead of manually creating the tables on AWS Athena and use the Athena data catalog, I decided to use AWS Glue data store, which crawled the converted parquet files and generated a schema that seems correct. It is:

CREATE EXTERNAL TABLE `table_fd2f388f79ee6`(
  `field1` string, 
  `field2` string, 
  `data` struct<attrib1:string,gpId:string,attrib2:boolean,attrib3:array<string>,attrib4:struct<f1:int,f2:int>>)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://path'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='crawlername', 
  'averageRecordSize'='17', 
  'classification'='parquet', 
  'compressionType'='none', 
  'objectCount'='2', 
  'recordCount'='726', 
  'sizeKey'='287', 
  'typeOfData'='file')

however, even for simple select * query I get the error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/year=2018/month=07/day=03/part-00258-e1bcec61-f24e-40a2-8fac-fdd017054c2a.c000.snappy.parquet (offset=0, length=5356): Column data.attrib type LIST not supported

Is this a bug, constraint or something that I need to correct?

1

1 Answers

0
votes

Your Athena table fields need to be exactly declare in the same order than the Parquet schema otherwise it will fail !

if you parquet schema is :

id: integer (nullable = false)
rating: struct (nullable = true)
  related_to: struct (nullable = true)
       category: integer (nullable = false)
       name: float (nullable = true)
       type: string (nullable = false)
  rating_results: array (nullable = true)
       element: struct (containsNull = true)
            toto: integer (nullable = false)
            tata: float (nullable = true)
            titi: string (nullable = true)
other: string (nullable = true)

You athena table need to be :

`id` INT,
`rating` STRUCT<
                 `related_to`: STRUCT<
                         `category`: INT,
                         `name`: FLOAT,
                         `type`: STRING
                 >,
                 rating_results : ARRAY<
                            STRUCT<
                            toto: INT,
                            tata: FLOAT,
                            titi: STRING>
                            >
                 >,
`other` STRING

Apparently AWS ATHENA do not set by default the SERDE option :

'hive.parquet.use-column-names' = 'true'

and do not apply it when you set it in the WITH SERDEPROPERTIES


Also be careful , if your parquet file is exported with Spark , then take a look to this option

"spark.sql.parquet.writeLegacyFormat", true

more details here : Query results difference between EMR-Presto and Athena

and last advise , be careful with the decimal type (it's fixed in presto but not in athena) : https://github.com/prestodb/presto/issues/7232