1
votes

The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?

The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?

2
You have a few options! The easiest is to set up a cron job to run on an ec2 instance every day a a certain time, the cron job would use psql to run your copy commandJon Scott
Thanks, I'll look into psql.allanth
You can write a lambda function that every time a file is uploaded in the bucket a trigger runs it. It is a few line of code, i use python and boto3 for this situationsMiloBellano
Lambda is OK but it will time out after 15 minutes.Jon Scott
@MiloBellano where do I write those Lambda functions?allanth

2 Answers

2
votes

Before you finalize your approach you should consider below important points:

  1. If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.

  2. Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation

Now, to answer your question:

As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).

Your copy command should look something like this :

PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/'     credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"

next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.

Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.

Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)

0
votes

You could create an external table over your bucket. Redshift would automatically scan all the files in the bucket. But bare in mind that the performance of queries may not be as good as with data loaded via COPY, but what you gain is no scheduler needed.

Also once you have an external table you could load it once to redshift with a single CREATE TABLE AS SELECT ... FROM your_external_table. The benefit of that approach is that it's idempotent - you don't need to keep track of your files - it will always load all data from all files in the bucket.