Modify S3 bucket partition and merge files while copying/replicate data from source to destination S3 bucket

Question

I am trying to copy the data from source AWS S3 bucket to destination AWS S3 bucket.

Current Partition of source bucket:

AccountId/Type/Role/Job/Name/RequestId/JobName/file1.csv

and I have many prefixes with this partitions.

But in destination bucket I want to change the partition to be like:

AccountId/Type/Role/Job/Name/file2.csv

and add RequestId and JobName as new columns in csv file and respective value which is merge all csv file in one file.

current file format VS expected format

I am trying AWS S3 replication to replicate the data from source to destination bucket but I did not find any feature in AWS S3 replication which let me modify the partition and merge all files at the time of replication.

I want to use destination bucket to query data using AWS Athena. For that I use AWS Glue crawler to crawl the data to create the database and table which will be used by AWS Athena to query the data.

(Update): These partitions are not stored as key=value and are too fine grained which leads to smaller files. Having many small files to process will hurt Athena performance.

Question: Is there a way to achieve this transformation using S3 replication? If not, Is is possible to achieve this by AWS glue jobs to transform data like mentioned above? Or is there any other way to achieve this transformation between S3 buckets? And how?

I think you don't need to do any special transformation if you have prefixes already present in s3 as partition_key=partition_value (key value) pair format. Can you update your question with exact s3 path so that I can confirm if you need Glue crawler/Glue ETL or simple Athena query? — Prabhakar Reddy
Thanks for your comment. I have updated the question with exact S3 path and prefixes present in S3 are not in key=value format. — Shubham Mehta

Ilya Kisil Ilya Kisil · Accepted Answer · 2020-08-10T11:55:47

This situation is a typical case for CTAS queries. On top of that you would know exactly how much this operation would cost, since Athena queries get charged base on amount of data scanned.

CREATE TABLE new_table
WITH (
    format = 'TEXTFILE',
    field_delimiter = ',',
    external_location = 's3://my_athena_results/ctas_result_path/.../',
    partitioned_by = ARRAY['AccoundId', 'Type', 'Role', 'Job', 'Name'],
    bucketed_by = ARRAY['some column with high cardinality'],
    -- Controls how many files you will get
    bucket_count = 3
) AS
SELECT *
FROM table1

Note

Having 5 partitions is still quite a lot, but of course this depends on how much data you have.
CTAS query can write only 100 partitions at a time. So you might need to run this statement multiple times with different filtering condition, i.e. WHERE .... More on limitations can be found here
You can also covert you data from CSV format to Parquet which most likely will use less space and is way more speed and cost efficient then CSV. Here is quite nice blog post on this topic

Modify S3 bucket partition and merge files while copying/replicate data from source to destination S3 bucket

1 Answers