Speedup my data load operation

Question

Please pardon my ignorance if this question may sound silly the expert audience here

Currently as per my use case I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets (operation is some what similar to Pivot for redshift database) and after that i am updating data back to redshift db using copy command

Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift

The count of csv would keep on increasing with time Currently the whole process takes about 50-60 minutes to complete

25 minutes to get approx 200 csv and update them in s3 buckets

25 minutes to update the approx 200 csv into 200 aws redshift tables

The size of csv vary form few MB to 1GB

I was looking for tools or aws technologies which can help me reduce my time

*additional info

Structure of csv keeps on changing .Hence i have to drop and create tables again This would be a repetitive tasks and would be executed in every 6hours

Redshift probably isn't a good usecase for what you're trying to do. Redshift is really for data warehousing. I have found from personal experience that Redshift is full of pitfalls that you'll run into unless you read through their best practices doc. If you want to do something like constantly generating reports, maybe look at using EMR. — user602525

Yankee Yankee · Accepted Answer · 2017-08-03T07:51:12

You can achieve a significant speed-up by:

Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.

Hope this helps.

Speedup my data load operation

2 Answers