2
votes

I am looking for recommendations on where to best store ETL metadata when building ETL/ELT processes that load data into a Snowflake data warehouse (running on Azure). By ETL metadata I mean tabular data required to orchestrate complex ETL workflows - e.g. dependencies between processes, current status info on each process, "control tables" that we can use within ETL to iterate over individual data sources to be loaded, etc.

In the legacy EDW that was build on SQL Server, we were able to use SQL Server for storing not only the actual EDW data, but also the metadata described above, and we could use T-SQL to access/manipulate this metadata.

With the new implementation, Snowflake does not seem to be an appropriate place to store ETL metadata. However, spinning up another cloud database (e.g. Azure SQL Database) just for this purpose would require significant additional expense. We are planning to use Talend Cloud to build our new ETL, but I imagine the question about storing ETL metadata would come up regardless of the choice of the integration tool.

Any feedback is greatly appreciated!

2
I imagine the performance tier required for storing and using ETL metadata would be pretty low, in which case Azure SQL wouldn't be that expensive. Also why does snowflake not appear to be an appropriate place?Nick.McDermaid
Store it in Snowflake as Marcel has answered, there is no reason not to.Simon D
Thanks for the feedback @Nick.McDermaid! One common use case for us is where we have to make thousands of API calls, persist JSON responses locally, then upload all responses to the Snowflake stage. For each successful API call, we update the control table, so that we do not have to repeat all calls if the job fails. Putting the control table in Snowflake means running the compute while we issue the calls - hence the extra cost. Also, it will increase the duration since Snowflake is not meant for this type of workload.Vladimir Kupchan
Running Azure SQL for this kind of workload is probably 100s of dollars a month which is probably nothing compared to snowflake.Nick.McDermaid
Thanks, @SimonDarr, like I said in the other comment the issue with storing in Snowflake is having to keep the compute running (hence extra cost) plus extra time in a scenario where we have to issue frequent (e.g. every second) queries against the ETL metadata. The scenario with the API calls that I provided is one example, there will be others.Vladimir Kupchan

2 Answers

1
votes

I'm also using Snowflake for all the EDW metadata. In my opinion there is no reason for not using Snowflake here.

Another thing I like pretty much regarding this topic is using Snowsight to visualize some statistics about my metadata and create something like a "controlling dashboard".

0
votes

I feel that this question is more about solutions architecture then any technical problem and is probably not appropriate for the Stack Overflow however...

I have had this exact problem and in general this is a interesting problem that comes with serverless databases. Traditionally when planning a database at least in my experience most of these ETL requirements are absorbed by the database your loading to in the first place this is usually a matter of cost and access.

With serverless computing your incentivised to batch your workloads as closely as possible in order to allow the service as much downtime as possible where you aren't paying for wasted compute time.

Active ETL Process's especially near real time systems are constantly checking for updates on your source system to propagate to the target database system but don't really have to talk to the target database at all unless there is new data to add or your doing data validation of some description.

This is where your proposed ETL DB comes in i don't want to hit snowflake unless i have to because it will cost me more so how can i cheaply create a DB to solve that requirement without using snowflake.

My suggestion is to think about what type of data you are going to be storing and what formats it is in.

If key value pairs work look at https://aws.amazon.com/dynamodb/ (price / performance wise very hard to beat this is how i solved my problem)

Also if your data isn't complex in nature consider using flat files or something like SQLite on the ETL server.