0
votes

We're exploring using Redshift for our warehouse and we need to move new data from our existing on-premises postgres db to Redshift nightly. It looks like you can do this with Pipeline and a template if your primary db is in RDS, but can you do it with Pipeline if your database is on-premises?

1
You can do that but pipeline is not made for that perpose ... - sandeep rawat

1 Answers

0
votes

Since you are synching in bulk on a periodic basis you might want to consider pushing your data to S3, which is easy to do from on-premises, and then loading it into redshift via the COPY command. It's fast and reliable and you get a backup for free (cheap) in s3 as a side effect.

You can use aws s3 cp from the awscli python tool to push up your CSV postgres dumps (after putting your access key in .aws/config) with something like:

aws s3 cp current_dump.csv.gz s3://yourbucket/20170108/dump.csv.gz

And then the regular postgres psql utility to execute the COPY on redshift like so:

PGPASSWORD='YOURPASS' psql -h your.redshift.end.point.com -U youruser -d yourdb -p 5439 -c "COPY yourtable FROM 's3://yourbucket/20170108/dump.csv.gz' CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]' DELIMITER ',' NULL 'NULL' IGNOREBLANKLINES EMPTYASNULL BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1 CSV GZIP;"

You could put your postgres dump and these commands in a shell script with a bit scripting of time values and run it as a cron job.

For best practice you want to upload to a staging table and then merge to your main table to support updates if needed and prevent duplicates.