Merge multiple parquet files to single parquet file in AWS S3 using AWS Glue ETL python spark (pyspark)

Question

I have AWS Glue ETL Job running every 15 mins that generates 1 parquet file in S3 each time.

I need to create another job to run end of each hour to merge all the 4 parquet file in S3 to 1 single parquet file using the AWS Glue ETL pyspark code.

Any one have tried it? suggestions and best practies?

Thanks in advance!

Emerson Emerson · Accepted Answer · 2020-03-25T06:12:18

well.. an easy option would be to convert it into a spark dataframe

1) read the parquet into a dynamic frame (or better yet, just read it as spark dataframe) 2) sourcedf.toDF().repartition(1)

Merge multiple parquet files to single parquet file in AWS S3 using AWS Glue ETL python spark (pyspark)

1 Answers