3
votes

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table. Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked. The layout looks like this:

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/01/file_2.parquet

s3://bucket/parquet_table/01/file_3.parquet

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/02/file_2.parquet

s3://bucket/parquet_table/02/file_3.parquet

I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.

If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw

java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)

Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?

2
Could you use redshift spectrum to work directly with your parquet files?Stephen Paulger
I ran into the same issue too. Have you figured out?cozyss
@rileyss Unfortunately not. I did not play around with it later. It was a test at the time.Grbinho

2 Answers

0
votes

Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.

0
votes

The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below

s3://your-bucket/parquet_table/id=1/file1.parquet

s3://your-bucket/parquet_table/id=2/file2.parquet

and so on..

then use the below path to read all the files in bucket

location : s3://your-bucket/parquet_table

If the data in s3 partition the above way, you'll not the face any issues.