Automatically load data into Redshift with the COPY function

Question

The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?

The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?

You have a few options! The easiest is to set up a cron job to run on an ec2 instance every day a a certain time, the cron job would use psql to run your copy command — Jon Scott
You can write a lambda function that every time a file is uploaded in the bucket a trigger runs it. It is a few line of code, i use python and boto3 for this situations — MiloBellano

KSN KSN · Accepted Answer · 2019-03-26T12:49:40

Before you finalize your approach you should consider below important points:

If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.
Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation

Now, to answer your question:

As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).

Your copy command should look something like this :

PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/'     credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"

next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.

Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.

Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)

Automatically load data into Redshift with the COPY function

2 Answers