Have any one built a data pipeline using AWS Glue to snowflake? Looking for solutions

Question

I am a newbie to AWS and snowflake. I am looking to load csv files from S3 to respective snowflake tables (about 100 tables) using aws glue. I was able to load data into 1 snowflake table using the below article

https://support.snowflake.net/s/article/How-to-Set-up-AWS-Glue-ETL-for-Snowflake

Is it possible to use 1 aws glue to load a list of tables?

Inside AWS Glue - can we write logic to update or insert data in snowflake based on csv files ?

Please advice and share any sample code /solutions if any.

Thanks, Jo

I know you are asking for Glue specifically but like someone else pointed out, you can use other tools that aren't so intensive. I would look into Snowflake's Snowpipe service. Basically you will need to set up a notification in S3 then some additional setup in Snowflake then Snowflake will auto-ingest new records from S3 without any jobs you need to maintain: docs.snowflake.com/en/user-guide/… — Brock

Gokhan Atil Gokhan Atil · Accepted Answer · 2020-04-10T22:57:52

First of all, if you do not need Spark to process/transform data in your CSV files, using Snowflake COPY command would be a better option. At the end, AWS Glue (Spark) will also upload the files on an internal stage and use COPY command to insert data to Snowflake database.

For using COPY command to load data, please check:

https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

https://docs.snowflake.com/en/user-guide/data-load-external-tutorial.html

About your questions:

Is it possible to use 1 aws glue to load a list of tables?

Yes, it's possible to use 1 AWS Glue job to load multiple tables. AWS Glue is a flexible tool that you can write your custom Spark code. On the other hand, for simplicity, I recommend you to use 1 job for 1 table.

Inside AWS Glue - can we write logic to update or insert data in snowflake based on csv files ?

Yes you can, but Spark is designed to process bulk data and Snowflake is a data warehouse. Updating or inserting single rows will be inefficient for both Spark and Snowflake. For running DMLs check:

https://docs.snowflake.com/en/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements

Have any one built a data pipeline using AWS Glue to snowflake? Looking for solutions

3 Answers