2
votes

We have a use case where we are processing the data in Redshift. But I want to create backup of these tables in S3, so that I can query these using Spectrum.

For moving the tables from Redshift to S3 I am using a Glue ETL. I have created a crawler for AWS Redshift. A Glue job converts the data to parquet and stores it in S3, partitioned by date. Then, another crawler crawls the S3 files to catalog the data again.

How can I eliminate the second crawler and do this in the job itself?

3

3 Answers

3
votes

There is no need to use AWS Glue or Athena to unload Redshift data to S3 in Parquet format. The feature to unload data in Parquet format is now supported in Redshift:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET

Documentation can be found at UNLOAD - Amazon Redshift

0
votes

Yeah you can eliminate the 2nd crawler but not in job, Alternatively you can directly create a table in Athena if you know schema of the files present in s3 using the create external table DDL statements. Then perform MSCK repair on the table to populate the data present in all the partition, once this is done, you can update new partition to the glue catalog using glue update partition API call.

0
votes

Instead of creating another crawler to update the glue catalogue create an external table on AWS Athena. For more details on create external table command refer the following documentation from AWS :

AWS Athena - Create Table

Since your data is partitioned, you will need to update the partition information in glue catalog after every run. For that you can use MSCK repair command as follow :

MSCK REPAIR TABLE table_name

To automate this with code you can use any AWS SDK from following link, depending on the language of your choice : AWS SDKs.