Why is Polybase slow for large compressed files that span 1 billion records?

Question

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?

For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.

Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.

What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?

When you mean degrade - do you mean get slower over time or just slow in general? — Murray Foxcroft
What resource class are you using? Consider using largerc to improve performance, at the cost of reduced concurrency. If you are connected as the default admin user then their resource class will be small by default and can't be changed. DWU400 is pretty low for doing anything, why not 1000, 2000 or 6000 temporarily, then lower it when your CTAS is done? This is one of the really useful features of Azure SQL Data Warehouse, along with pause. — wBob
We're on xlarge. We found out the admin was small the hard way (why is that?). We're limited to 400 DWU as part of the free credit. We're trying to revert this, but we now have to recreate to do pay-as-you-go. We're trying that next to see if scaling to 6000 DWU has any impact. — Fastidious
Some other best practices would be, do not have multiple files per zip file, the number of files should be greater than or equal to the total number of readers of your service level objective. At DWU400 the max external readers is 32. At DWU6000, the max external readers is 480. You might also experiment with mediumrc for potentially increased concurrency. Please report back any findings you have as they will be useful for the thread! — wBob

Murray Foxcroft Murray Foxcroft · Accepted Answer · 2017-02-20T12:01:37

In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.

Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.

If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.

Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.

You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.

Also look in to:

CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

Why is Polybase slow for large compressed files that span 1 billion records?

1 Answers