Combine multiple raw files into single parquet file

Question

I have a large number of events partitioned by yyyy/mm/dd/hh in S3. Every partition has about 80.000 of raw text files. Every raw file has about 1.000 Events in JSON format.

When I run the script to do my transformation:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database=from_database,
                                                                table_name=from_table,
                                                                transformation_ctx="datasource0")
map0 = Map.apply(frame=datasource0, f=extract_data)
applymapping1 = ApplyMapping.apply(......)
applymapping1.toDF().write.mode('append').parquet(output_bucket, partitionBy=['year', 'month', 'day', 'hour'])

I end up with a large number of small files across partitions named like:

part-00000-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet
part-00001-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet
part-00002-a5aa817d-482c-47d0-b804-81d793d3ac88.snappy.parquet

Each of them is 1-3KB in size. Number corresponds roughly to the number of raw files I have.

My impression was that the Glue will take all the events from catalog, partition them the way I want and store in a single file per partition.

How do I achieve that?

Sahil Desai Sahil Desai · Accepted Answer · 2017-11-07T05:04:02

You just need to set repartition(1) which will shuffle the data from all partitions to a single partition which will generate a single output file while writing.

applymapping1.toDF()
             .repartition(1)
             .write
             .mode('append')
             .parquet(output_bucket, partitionBy=['year', 'month', 'day', 'hour'])

Combine multiple raw files into single parquet file

1 Answers