1
votes

Architectural/perf question here.

I have a on premise SQL server database which has ~200 tables of ~10TB total. I need to make this data available in Azure in Parquet format for Data Science analysis via HDInsight Spark.

What is the optimal way to copy/convert this data to Azure (Blob storage or Data Lake) in Parquet format?

Due to manageability aspect of task (since ~200 tables) my best shot was - extract data locally to file share via sqlcmd, compress it as csv.bz2 and use data factory to copy file share (with 'PreserveHierarchy') to Azure. Finally run pyspark to load data and then save it as .parquet.

Given table schema, I can auto-generate SQL data extract and python scripts from SQL database via T-SQL.

Are there faster and/or more manageable ways to accomplish this?

1

1 Answers

-1
votes

ADF matches your requirement perfectly with one-time and schedule based data move.

Try copy wizard of ADF. With it, you can directly move on-prem SQL to blob/ADLS in Parquet format with just couple clicks.

Copy Activity Overview