How to create a Redshift table using Glue Data Catalog

Question

I'm developing ETL pipeline using AWS Glue. So I have a csv file that is transformed in many ways using PySpark, such as duplicate column, change data types, add new columns, etc. I ran a crawler with the data stores in S3 location, so it created Glue Table according to the given csv file. I mean when I add a new column to the csv file, it will change the Glue Table accordingly when running the crawler.

Now I want to do the same with Amazon Redshift, what I want to do is create a table in Redshift which is similar to the Glue table I mentioned earlier(created using csv). A lot of answers explain about to create Redshift schemas manually. I did the same, but when the data type changes I have to manually update it. When csv file changes Redhsift table must be updated accordingly.

Can I do the same using crawlers? I mean create a Redhsift table that is similar to the Glue Catalog Table? So when data type change or column removed or added in csv file we can run a crawler, can we do this using crawler, or are there any other method that fulfills my need? This should be a fully automated ELT pipeline.

Any help would be greatly appreciated!

fernolimits fernolimits · Accepted Answer · 2021-03-02T07:52:04

The answers for all your questions are a big tasks. What I recommend is get right the concepts of every piece of the puzzle you want to put together.

The csv files apparently have flexibility, which you will not get in Redshift, it is because the columns aren't really typed, it is just text... and it is very slow. I would recommend you to use parquet files.

Regarding Redshift, if your table isn't there, you just use spark to write the table, and it will be created, BUT... you will not be able to set DISTKEY, SORTKEY... it is used for temp tables normally. If you have additional column, you don't need to create it manually, spark will do it. But change columns data type, it is not simple and you will not achieve it (easily) via ETL.

Finally the data catalog, it is just a schema, the metadata, mostly you use a table to create the metadata, not the metadata to create a table.

How to create a Redshift table using Glue Data Catalog

1 Answers