AWS Glue convert files from JSON to Parquet with same partitions as source table

Question

We are using AWS glue to convert JSON files stored in our S3 datalake.

Here are the steps that I followed,

Created a crawler for generating table on Glue from our datalake bucket which has JSON data.
The newly created tables have partitions as follows,

Name, Year, Month, day, hour
Created a glue job to convert it to Parquet and store in a different bucket

With these process, the jobs run successfully but the data in the new bucket is not partitioned. Its just comes under a single directory.

What I want to achieve is the converted parquet files should get the same partitions as in the source table/data lake bucket.

Also, i want to increase the file size of the parquet files(reduce the number of files).

Can anyone help me on this?

Can you please add your write-dynamic-frame code to your question and the path(s) within your bucket for the resulting files. Have you tried the example code from Managing Partitions for ETL Output in AWS Glue? — Steven Ensslen

kmn kmn · Accepted Answer · 2019-01-11T12:50:15

Try the below for writing the dynamic frame.

glueContext.write_dynamic_frame.from_options(
frame=<output_dataframe>,
connection_type="s3",
connection_options={"path": "s3://<output_bucket_path>",
                    "partitionKeys": ["Name", "Year", "Month" , "day", "hour"]},
format="parquet")

AWS Glue convert files from JSON to Parquet with same partitions as source table

1 Answers