2
votes

I generate an ORC table (compresssed w/ Snappy) with Spark (Databricks) on an Azure Storage Account (w/ ADLS Gen2 feature). This ORC represent about 12 GB of data (1.2 billions lines). This table has 32 columns.

Once it's generated, I load this file inside an Internal table within Synapse Analytics table using Polybase.

Here my results with different configuration :

  • DW100c / smallrc = 3h52
  • DW400c / smallrc = 1h50
  • DW400c / xlargerc = 1h58
  • DW1000c / xlargerc = 0h50
  • DW1500c / xlargerc = 0h42

When I look at Storage Account ingress/egress, I saw activity during a few minutes (maybe for copying the ORC files between Synapse nodes) ...... then Synapse resources begin to be stressed. I saw CPU activity for a while then memory increase slowly, slowy, ...

Here memory (red) and CPU max % (blue) example :

enter image description here

Do I need to scale up again ? I don't think this is a pb of network througput. Or maybe a configuration problem ? In regard of Polybase I doesn't understand why this is so slow. Polybase is suppose to ingest TB of ORC data quickly !

BR, A.

Edit: DWU usage

enter image description here

1
Can you post your code?GregGalloway
@GregGalloway it's SELECT INTO FROM an external table. We are using Managed Service identity (SCOPE CREDENTIAL) on an EXTERNAL DATA SOURCE. My EXTERNAL FILE FORMAT is using ORC and Snappyalxsbn
so no CAST functions? What are data types? Did you choose the minimum string widths possible?GregGalloway
We are using NVARCHAR with max values since this is unstructured dataalxsbn
meaning you are importing JSON data or something that may be arbitrary length and you will parse it later in a stored proc? It may be worth a test to profile the max actual length and see if shrinking the width of the string columns will improve loading performance. I’m worried that the internal data movement buffers will be sized to allow wide data and will perform worse than regularly sized columns. At least that is a common best practice in Azure DW that is well documented.GregGalloway

1 Answers

0
votes

There are a couple of things you can try. Azure Synapse Analytics (formerly known as Azure SQL Data Warehosue) has a concept of readers and writers tied to the DWU. I can't find a current version of this documentation, but some old gen1 docs I have indicates DWU1500 has 120 readers. This strongly suggests you should split your one big file up into many files.

I would do some experiments, starting at 10, ie 10 files of 1.2GB each and work up until you find an optimal setting for your workload. I would say I have not tested this with ORC files not it's not clear to me if the ORC format is already inherently partitioned. Try it and see.

You can also try CTAS (if you're not already using it). This will also take advantage of Synapse's ability to parallelise work.

There is also a new feature currently in preview called COPY INTO. According to the documentation it is compatible with ORC files and does not require you to split them:

What is the file splitting guidance for the COPY command loading Parquet or ORC files? There is no need to split Parquet and ORC files because the COPY command will automatically split files. Parquet and ORC files in the Azure storage account should be 256MB or larger for best performance.

https://docs.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#what-is-the-file-splitting-guidance-for-the-copy-command-loading-parquet-or-orc-files

COPY INTO test_orc
FROM 'https://yourAccount.blob.core.windows.net/yourBlobcontainer/folder1/*.orc'
WITH (
    FILE_FORMAT = yourFileFormat
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)

Work out if you are DWU-bound by viewing the DWU usage in the portal - see if it's maxed out / flatlined, which I guess it isn't.