0
votes

Question : Can anyone help me to find a solution for ingesting data from Azure Data factory to snowflake table without using azure blob storage.

Requirements : We got a set of customer IDs stored in snowflake table right now.We want to iterate through each of the customer id and fetch all customer details from Amazon S3 using WebAPI and write it back to snowflake table. The current system uses Azure Databricks(PySpark) to POST customer id and GET related json data from S3 using WebAPI,parse json to extract our required info and write it back to snowflake. But this process takes at least 3 seconds for a single record and we cannot afford to spend that much time for data ingestion as we have large data volume to process and running ADB cluster for long time cost more. The solution we think is like instead of using python Web API,we can use azure data factory to get data from s3 bucket and ingest it to snowflake table. Since the data is customer data ,we are not suppose to store that in azure blob storage before writing it to snowflake due to privacy rules.Do we have any other method that can be used to write it to snowflake table directly from s3 or through ADF without using blob storage.

2
You don't need expensive databricks to read data from snowflake and lookup a web service and write data back to snowflake. You can use just about any scripting language to do this. How many customer id's do you have? If you have anything under maybe 10 million or so you could cache them somewhere outside snowflake, i.e. an on prem sql server (express) - free, then use powershell to iterate through, call the web service (free), get the result and write back to sql server (free), then once all of that is done, you can bulk upload the results to snowflake (expensive)Nick.McDermaid
@Nick.McDermaid Thank you very much for your suggestion. We are considering a daily batch ETL flow where our daily customer count varies from 10K to 15K. But before we productionise this system as a whole,we may have to load historical data for 10 months.We are completely avoiding any on-prem database as of now as our systems are completely on azure.As of now the customer data (only IDs) is staged in snowflake table.We are trying to optimize related data lookup/extraction for these IDs from WebAPIs.Nidheesh kumar
Have you considered leveraging Snowflake's internal stage? I realize that this is also Azure Blob, but the security and encryption on it is extremely high, and the data gets flushed as soon as the COPY INTO is successful.Mike Walton
Well you can build this in something like Azure functions and Azure SQL if you like. The fact remains there are much cheaper platforms to achieve this onNick.McDermaid

2 Answers

0
votes

You can create a databricks notebook and read all data from s3 and for temp purpose store the data on dbfs which will be destroyed as soon as the cluster terminates.

ADF -> Databricks Notebook

Databricks
Read from s3 -> create a pyspark dataframe -> filter the data based on your condition -> write to snowflake
0
votes

Well, if your data is already on S3 you can just use the COPY INTO command. https://docs.snowflake.com/en/user-guide/data-load-s3.html