I generate an ORC table (compresssed w/ Snappy) with Spark (Databricks) on an Azure Storage Account (w/ ADLS Gen2 feature). This ORC represent about 12 GB of data (1.2 billions lines). This table has 32 columns.
Once it's generated, I load this file inside an Internal table within Synapse Analytics table using Polybase.
Here my results with different configuration :
- DW100c / smallrc = 3h52
- DW400c / smallrc = 1h50
- DW400c / xlargerc = 1h58
- DW1000c / xlargerc = 0h50
- DW1500c / xlargerc = 0h42
When I look at Storage Account ingress/egress, I saw activity during a few minutes (maybe for copying the ORC files between Synapse nodes) ...... then Synapse resources begin to be stressed. I saw CPU activity for a while then memory increase slowly, slowy, ...
Here memory (red) and CPU max % (blue) example :
Do I need to scale up again ? I don't think this is a pb of network througput. Or maybe a configuration problem ? In regard of Polybase I doesn't understand why this is so slow. Polybase is suppose to ingest TB of ORC data quickly !
BR, A.
Edit: DWU usage