2
votes

In Azure Data Warehouse, it is recommended to use HASH for distributed tables on the fields you plan to JOIN and potentially even aggregate on. It's also recommended that you create statistics on the fields you plan to use.

Let's say you have two tables with the same amount of records and fields. One table is hashed on a very unique key where the other is ROUND_ROBIN, where data is randomly divided evenly across the 60 databases.

-- CustomerID is alphanumeric
SELECT
   [ProductID]
  ,COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable]
GROUP BY
   [Product]

On the hashed table, if you aggregate on the hashed key, you could see it's returning the results within 0.05 seconds. On the round robin table, with the same aggregation, it's 0.51 seconds.

CREATE STATISTICS [ProductID] ON [dbo].[FactTable] ([ProductID]);
CREATE STATISTICS [CustomerID] ON [dbo].[FactTable] ([CustomerID]);

When you apply statistics to the fields you are aggregating, the hashed table still returns within 0.05 seconds. No change. When you do the same to the round robin table, the results return to same as the hashed within 0.05 seconds.

Note: 2000 DWU running query in xlarge resource (max memory allocation)

When examining the distribution of the hashed key, I found 140 million records are stored in one distribution of the 60 databases. 50 million other records are distributed pretty evenly across the other 59 databases. A default value is making my hashed key not a great candidate for a hash.

My question is, while the round robin table with statistics is performing great on the same field I am hashing on with the other table, will the same performance stick when I use the round robin on JOINS with that key? I haven't tested it fully yet, but looking for the best approach.

Does statistics help on joins? Some of the articles I read say they do, but seeing if anyone else has more solid answers on the approach when considering round robin over a distribution key. I don't really have any good candidates that don't cause data skew like in my above example with the 140 million.

1
If you want the answers to be more concrete it would help if you would post the explain plan for all the queries you tested. Put the word EXPLAIN in front of the query and run it then copy the XML into your question.GregGalloway
Don't really have one, but included SQL sample for reference. It's just a basic SELECT COUNT(DISTINCT) GROUP BY query.Fastidious
can you get one? I would like to see if stats change the explain plan or if performance differences are just related to caching.GregGalloway
I'll see if I can. I don't believe it's caching. I ran the query multiple times at the same speeds (0.51) before applying the statistics. It changed instantly. I don't really notice any caching in Azure DW.Fastidious

1 Answers

1
votes

First of all, in the current version of Azure SQL Data Warehouse, you should always create statistics on columns you will be using in joins, GROUP BY etc. Ignore the timings you got without statistics as they are irrelevant.

One of the criteria for a good hash key, is that it distributes the data evenly. If you don't have a good candidate, then ROUND_ROBIN is the alternative. The thing to bear in mind about ROUND_ROBIN is that you always get data movement, but sometimes that is fine. It really depends on what your key queries are. My advice would be to optimise for them.

For the examples you have give, the performance is so quick that it's really not worth spending too much time on. Do you have some more realistic queries?

Azure SQL Data Warehouse does cache, similar to SQL Server, as described here.

I mocked up a test rig using the AdventureWorksDW sample available when you provision a warehouse from the portal, and found the hashed table performed significantly better (despite large skew) for these sample queries but your mileage may vary (YMMV):

-- 603,980,000 rows
CREATE TABLE dbo.FactTable_rr
WITH ( 
    DISTRIBUTION = ROUND_ROBIN, 
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT a.ProductKey AS [ProductID], a.CustomerKey AS [CustomerID]
FROM [dbo].[FactInternetSales] a
    CROSS JOIN ( SELECT TOP 1000 1 FROM [dbo].[FactInternetSales] ) b(c)
    CROSS JOIN ( SELECT TOP 10 1 FROM [dbo].[FactInternetSales] ) c(c)
GO


CREATE STATISTICS st_dbo_FactTable_rr_ProductID ON dbo.FactTable_rr ( ProductID ) WITH FULLSCAN;
CREATE STATISTICS st_dbo_FactTable_rr_CustomerID ON dbo.FactTable_rr ( CustomerID ) WITH FULLSCAN;
GO


CREATE TABLE dbo.FactTable_hh
WITH ( 
    DISTRIBUTION = HASH( [ProductID] ),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM FactTable_rr
GO

CREATE STATISTICS st_dbo_FactTable_hh_ProductID ON dbo.FactTable_hh ( ProductID ) WITH FULLSCAN;
CREATE STATISTICS st_dbo_FactTable_hh_CustomerID ON dbo.FactTable_hh ( CustomerID ) WITH FULLSCAN;
GO



-- Find data skew for a distributed table
DBCC PDW_SHOWSPACEUSED('dbo.FactTable_rr');
DBCC PDW_SHOWSPACEUSED('dbo.FactTable_hh');
GO


--EXPLAIN
SELECT
   [ProductID],
   COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable_rr]
GROUP BY [ProductID]
OPTION ( LABEL = 'rr' );


--EXPLAIN
SELECT
   [ProductID],
   COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable_hh]
GROUP BY [ProductID]
OPTION ( LABEL = 'hh' );


-- Summary
SELECT [label], COUNT(*) records, CAST( AVG(total_elapsed_time) / 1000. AS DECIMAL(10,2) ) total_elapsed_time_s
FROM sys.dm_pdw_exec_requests 
WHERE [label] IS NOT NULL
  AND command Like 'select%'
GROUP BY [label];

My results, with a table of 600 million rows:

My results

If I look at the two EXPLAIN plans for the two queries (rr, hh), I see a much simpler plan for the hash table with no data movement. The 'rr' plan contains a SHUFFLE_MOVE operator, which redistributes a distributed table.