0
votes

Please pardon my ignorance if this question may sound silly the expert audience here

Currently as per my use case I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets (operation is some what similar to Pivot for redshift database) and after that i am updating data back to redshift db using copy command

Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift

The count of csv would keep on increasing with time Currently the whole process takes about 50-60 minutes to complete

25 minutes to get approx 200 csv and update them in s3 buckets

25 minutes to update the approx 200 csv into 200 aws redshift tables

The size of csv vary form few MB to 1GB

I was looking for tools or aws technologies which can help me reduce my time

*additional info

Structure of csv keeps on changing .Hence i have to drop and create tables again This would be a repetitive tasks and would be executed in every 6hours

2
Redshift probably isn't a good usecase for what you're trying to do. Redshift is really for data warehousing. I have found from personal experience that Redshift is full of pitfalls that you'll run into unless you read through their best practices doc. If you want to do something like constantly generating reports, maybe look at using EMR.user602525

2 Answers

2
votes

You can achieve a significant speed-up by:

  1. Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
  2. Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.

Hope this helps.

1
votes

You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.

It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.

However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.

Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.