3
votes

I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse.

This enables us to read from the data lake, using well known SQL.

However, another option is using Data Lake Analytics, or some variation of HDInsight.

Performance wise, I'm not seeing much difference. I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL?

With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database).

2

2 Answers

5
votes

If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS.

Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. See this blogpost for more information:

A common use case for ADLS and SQL DW is the following. Raw data is ingested into ADLS from a variety of sources. Then ADL Analytics is used to clean and process the data into a loading ready format. From there, the high value data can be imported into Azure SQL DW via PolyBase.

..

You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table.

3
votes

Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. Also, the use cases between ADLA/U-SQL and SDW are different.

ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. In your example above, you seem to be mainly doing the second part. Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows).