I wrote an ETL job to convert a bunch of JSON files into time partitioned parquet files (objects) stored on S3.
Instead of manually creating the tables on AWS Athena and use the Athena data catalog, I decided to use AWS Glue data store, which crawled the converted parquet files and generated a schema that seems correct. It is:
CREATE EXTERNAL TABLE `table_fd2f388f79ee6`(
`field1` string,
`field2` string,
`data` struct<attrib1:string,gpId:string,attrib2:boolean,attrib3:array<string>,attrib4:struct<f1:int,f2:int>>)
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://path'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='crawlername',
'averageRecordSize'='17',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='726',
'sizeKey'='287',
'typeOfData'='file')
however, even for simple select *
query I get the error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/year=2018/month=07/day=03/part-00258-e1bcec61-f24e-40a2-8fac-fdd017054c2a.c000.snappy.parquet (offset=0, length=5356): Column data.attrib type LIST not supported
Is this a bug, constraint or something that I need to correct?