How can I use Azure Data Factory to split results from a Database query into multiple blob files for Snowflake ingestion?

Question

We've used ADF's copy function to extract single files from Database tables to individual blob files. We regularly have files that are 8-12 GB with some occasionally much larger. When loading into Snowflake, it appears that Snowflake loading uses 1 thread per file - meaning it can take over an hour to load a single file. In contrast, when we look at 9 GB of blobs from an API source that forces us to paginate (resulting in hundreds of files that are ~50 MB), we see load times of around 2 minutes because of parallel processing.

What we really want to do is: 1. Perform a full extract so all data is loaded (we looked into partitioning but it looks like it will hit the table multiple times for each run - we want all of the data as of a specific time and do not want the data changing between runs). 2. Write multiple files from that extract, preferably by file size - but if we could choose the number of records we could use trial and error to pinpoint the file sizes we're looking for.

We've tried Azure Blob Tables as a staging area sink, but they have limited datatypes that conflict with our source datatypes. Our sources are varied and on-prem: Oracle, MySql, and SQL Server (so far). We looked at the new Data Flows in ADF, but they are in preview and only work on Azure SQL sources.

Is there any way to paginate your writes in ADF?

Thanks!

MartinJaffer-MSFT MartinJaffer-MSFT · Accepted Answer · 2019-03-12T00:13:18

There is an alternative to Data Factory. There is a product by Matilion which is built specifically for Snowflake and has most of the connectors and capabilities of Data Factory. You can find "Matilion ETL for Snowflake" in the Azure Marketplace ("Create a resource" in the Azure portal).

How can I use Azure Data Factory to split results from a Database query into multiple blob files for Snowflake ingestion?

1 Answers