0
votes

I get the following error when running a GLUE job over partitioned parquet files Unable to infer a schema for Parquet. It must be specified manually

I have set up my crawler and successfully obtained the schema for my parquet files. I can view the data in Athena. I have created the schema manually on my target Redshift.

I can load the files via GLUE into Redshift if all my data is in one folder only. BUT when I point at a folder that has nested folders, e.g. folder X - has 04 and 05 - the GLUE job fails with the message Unable to infer a schema for Parquet. It must be specified manually

Which is strange as it works if I put all these files into the same folder?

2

2 Answers

0
votes

I found a solution here - this works for me Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

It is the scala version of the ETL glue job

0
votes

If you point direct at partition folder then partition folder will no longer be column in table schema. It is better to use predicate pushdown - https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ while pointing at top folder.