I am trying to copy around 50 million files and 15TB in total size from one s3 bucket to another bucket. There are AWS CLI option to copy fast. But in my case, I want to put a filter and date range. So I thought to write code by using boto3.
The source bucket input structure:
Folder1 File1 - Date1 File2 - Date1 Folder2 File1 - Date2 File2 - Date2 Folder3 File1_Number1 - Date3 File2_Number1 - Date3 Folder4 File1_Number1 - Date2 File2_Number1 - Date2 Folder5 File1_Number2 - Date4 File2_Number2 - Date4
So the purpose is to copy all files which start with 'File1' from each folder by using a date range(Date2 to Date4). date(Date1, Date2, Date3, Date4) is file modified date.
The output would have date key partition and I am using UUID to keep every file name unique so it would never replace the existing file. So the files which have an identical date(modified date of the file) will be in the same folder.
Target Bucket would have output:
Date2 File1_UUID1 File1_Number1_UUID2 Date3 File1_Number1_UUID3 Date4 File1_Number2_UUID4
I have written code by using boto3 API and AWS glue to run the code. But boto3 API copies 500 thousand files every day.
The code:
s3 = boto3.resource('s3', region_name='us-east-2', config=boto_config)
# source and target bucket names
src_bucket_name = 'staging1'
trg_bucket_name = 'staging2'
# source and target bucket pointers
s3_src_bucket = s3.Bucket(src_bucket_name)
print('Source Bucket Name : {0}'.format(s3_src_bucket.name))
s3_trg_bucket = s3.Bucket(trg_bucket_name)
print('Target Bucket Name : {0}'.format(s3_trg_bucket.name))
# source and target directories
trg_dir = 'api/requests'
# source objects
s3_src_bucket_objs = s3_src_bucket.objects.all()
# Request file name prefix
file_prefix = 'File1'
# filter - start and end date
start_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d").replace(tzinfo=None)
end_date = datetime.datetime.strptime("2020-06-15", "%Y-%m-%d").replace(tzinfo=None)
# iterates each source directory
for iterator_obj in s3_src_bucket_objs:
file_path_key = iterator_obj.key
date_key = iterator_obj.last_modified.replace(tzinfo=None)
if start_date <= date_key <= end_date and file_prefix in file_path_key:
# file name. It start with value of file_prefix.
uni_uuid = uuid.uuid4()
src_file_name = '{}_{}'.format(file_path_key.split('/')[-1], uni_uuid)
# construct target directory path
trg_dir_path = '{0}/datekey={1}'.format(trg_dir, date_key.date())
# source file
src_file_ref = {
'Bucket': src_bucket_name,
'Key': file_path_key
}
# target file path
trg_file_path = '{0}/{1}'.format(trg_dir_path, src_file_name)
# copy source file to target
trg_new_obj = s3_trg_bucket.Object(trg_file_path)
trg_new_obj.copy(src_file_ref, ExtraArgs=extra_args, Config=transfer_config)
# happy ending
Do we have any other way to make it fast or any alternative way to copy files in such target structure? Do you have any suggestions to improve the code? I am looking for some faster way to copy files. Your input would be valuable. Thank you!
aws s3 syncCLI as a command from your python script? - Marcin