0
votes

I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:

  1. I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
  2. For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
  3. All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
  4. The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
  5. I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
  6. I want to get any error notifications in real-time when issues in the pipeline occur.

I hope you are able to guide me on relevant documentation/tutorials for this effort. I would truly appreciate the guidance.

Thank you very much.

2

2 Answers

1
votes

Snowflake does not have any orchestration tools like Airflow or Oozie. So you need to use or think of using some of the Snowflake Partner Ecosystem tools like Mattilion etc. Alternatively, you can build your own end to end flow using Spark or python or any other programming language which can connect snowflake using JDBC/ODBC/Python connectors.

To feed the data realtime to snowflake from s3, you can use AWS SNS service and invoke a SnowPipe to feed the data to Snowflake Stage environment and take it fwd via the ETL process for consumption.

Answer to each one of your question

I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?

It is not possible in snowflake, you have to do it via AWS or some other tool.

For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.

Ans: You can pull any size of data and you can also have some scripting to support that via SF, but invocation need to be programmed.

All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.

Ans: Feeding data to Snowflake is possible via AWS SNS (or REST API) + SnowPipe but visa-versa is not possible.

The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.

Ans: This is possible via AWS SNS + SnowPipe.

I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).

Ans:Not possible via Snowflake.

0
votes

I would approach the problem like this:

In a temporary table hold 12 months of data ( i am sure you know all the required queries, just since you asked for tutorials i am thinking it will be helpful may be for you as well as others )

-- Initial Pull Hold 12 months of Data .... 
Drop table if exists <TABLE_NAME>; 
Create Temporary Table <TABLE_NAME> as (
Select * 
From Original Table 
Where date_field between current_date -365 and Current_date 
); 

-- Export data to S3 ... 
copy into 's3://path/to/export/directory'
from DB_NAME.SCHEMA_NAME.TABLE_NAME
file_format = (type = csv field_delimiter = '|' skip_header = 0)
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key');

once your ML stuff is done, import data back to snowflake like this:

-- Import to S3 ... 
copy into DB_NAME.SCHEMA_NAME.TABLE_NAME
from 's3://path/to/your/csv_file_name.csv'
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key')
file_format = (type = csv field_delimiter = '|' skip_header = 1); 

I am not sure if snowflake has released ML stuff and how you will do ML at your end etc.

for the scheduling I would suggest either:

  1. Place your code in a shell script or a python script and schedule it to run once a month.

  2. Use Snowflake tasks as follows:

CREATE TASK monthly_task_1 WAREHOUSE = SCHEDULE = 'USING CRON 0 0 1 * * America/Chicago' AS insert your create temporary table query here;

CREATE TASK monthly_task_2 WAREHOUSE = AFTER monthly_task_1 AS insert your S3 export query here;

You can read more about snowflake tasks here: https://docs.snowflake.com/en/sql-reference/sql/create-task.html

For importing results back to Snowflake from S3 after ML stuff is done, you can add few lines in your ML code ( presumably in Python ) to execute the copy into code for --Import to S3 which is written above.