1
votes

I have AWS Glue ETL Job running every 15 mins that generates 1 parquet file in S3 each time.

I need to create another job to run end of each hour to merge all the 4 parquet file in S3 to 1 single parquet file using the AWS Glue ETL pyspark code.

Any one have tried it? suggestions and best practies?

Thanks in advance!

1

1 Answers

0
votes

well.. an easy option would be to convert it into a spark dataframe

1) read the parquet into a dynamic frame (or better yet, just read it as spark dataframe) 2) sourcedf.toDF().repartition(1)