AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

Question

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table. Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked. The layout looks like this:

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/01/file_2.parquet

s3://bucket/parquet_table/01/file_3.parquet

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/02/file_2.parquet

s3://bucket/parquet_table/02/file_3.parquet

I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.

If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw

java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)

Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?

Could you use redshift spectrum to work directly with your parquet files? — Stephen Paulger
@rileyss Unfortunately not. I did not play around with it later. It was a test at the time. — Grbinho

Zeitgeist Zeitgeist · Accepted Answer · 2018-02-05T20:53:39

Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

2 Answers