0
votes

My use case is simple. I have 20 TB raw csv uncompressed data in s3 with a partition folder structure of year (10 partitions for 10 years, each partition has 2 TB). I want to convert this data into parquet format(snappy compressed) and keep the similar partition/folder structure. I want ONE Parquet table with TEN 10 partitions in Athena which I will use to query this data by partition and maybe get rid of the raw csv data later. With Glue, it seems like I will create 10 parquet tables which I can't use.

Is this doable in Glue? Instead of using EC2, Hive/Spark I was looking for simple solution. Any recommendation? Any help is much appreciated.

1

1 Answers

2
votes

Assuming you have Glue Catalogs on that data, you could load it in as a dynamic frame and then write it back as parquet to the new location:

dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
    database=glue_database_name,
    table_name=glue_table_name)
data_frame = dynamic_frame.toDF()
data_frame.repartition("year")\
    .write\
    .partitionBy("year")\
    .parquet('s3://target-bucket/prefix/')