0
votes

I have a S3 bucket stores daily log files, named "Log_YYYY_MM_DD". Also, I have a table in Redshift only stores data of the latest log. For example:

  • In S3, I have "Log_Date0.csv", "Log_Date1.csv", "Log_Date2.csv"
  • And the redshift table should only contains data in "Log_Date2.csv"

How can I setup a daily data pipeline so that redshift automatically copy the latest log contains and rewrite the table?

Thanks

1

1 Answers

0
votes

Amazon Redshift cannot "automatically copy" files. It will only load files when the SQL COPY command is issued.

Therefore, it is up to you to create the necessary commands to TRUNCATE the table and then COPY the data from Amazon S3.

You could write a cron job that uses psql v8.0.2 (to match Redshift) to run the SQL commands. Alternatively, you could use a third-party ETL product that can load the data for you.

To trigger the automatic load of information into Amazon Redshift when a new file is added to an Amazon S3 bucket, you could:

  • Create an AWS Lambda function that sends the COPY command to Redshift via a psql/JDBC connection
  • Configure the S3 bucket to trigger the Lambda function when ObjectCreated

See: A Zero-Administration Amazon Redshift Database Loader