AWS GLUE job failure working with partitioned Parquet files in nested s3 folders

Question

I get the following error when running a GLUE job over partitioned parquet files Unable to infer a schema for Parquet. It must be specified manually

I have set up my crawler and successfully obtained the schema for my parquet files. I can view the data in Athena. I have created the schema manually on my target Redshift.

I can load the files via GLUE into Redshift if all my data is in one folder only. BUT when I point at a folder that has nested folders, e.g. folder X - has 04 and 05 - the GLUE job fails with the message Unable to infer a schema for Parquet. It must be specified manually

Which is strange as it works if I put all these files into the same folder?

Unknown Unknown · Accepted Answer · 2019-01-18T11:56:59

I found a solution here - this works for me Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

It is the scala version of the ETL glue job

AWS GLUE job failure working with partitioned Parquet files in nested s3 folders

2 Answers