6
votes

We are using AWS glue to convert JSON files stored in our S3 datalake.

Here are the steps that I followed,

  1. Created a crawler for generating table on Glue from our datalake bucket which has JSON data.

  2. The newly created tables have partitions as follows,

    Name, Year, Month, day, hour

  3. Created a glue job to convert it to Parquet and store in a different bucket

With these process, the jobs run successfully but the data in the new bucket is not partitioned. Its just comes under a single directory.

What I want to achieve is the converted parquet files should get the same partitions as in the source table/data lake bucket.

Also, i want to increase the file size of the parquet files(reduce the number of files).

Can anyone help me on this?

1
Can you please add your write-dynamic-frame code to your question and the path(s) within your bucket for the resulting files. Have you tried the example code from Managing Partitions for ETL Output in AWS Glue?Steven Ensslen

1 Answers

3
votes

Try the below for writing the dynamic frame.

glueContext.write_dynamic_frame.from_options(
frame=<output_dataframe>,
connection_type="s3",
connection_options={"path": "s3://<output_bucket_path>",
                    "partitionKeys": ["Name", "Year", "Month" , "day", "hour"]},
format="parquet")