0
votes

In Azure Data Lake Analytics (ADLA) you can only define one index on a table, and it has to be clustered. It you have two kind of analysis that requires partitioning on different keys to be efficient, you have to duplicate the table create scripts etc to get things running parallel.

E.g. The simple Order example

CREATE TABLE dbo.Orders
(
    OrderID int,
    CustomerID int,
    OrderDetailID int,
    OrderTotal double,
    OrderDate DateTime,
    INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(OrderId)
);

This works fine when your queries partitions/pivots around OrderId. But what if you have a query that is based on CustomerID or ranges of CustomerID?

What I do currently is to do stuff like this

CREATE TABLE dbo.Orders_ByCustomerId
(
    OrderID int,
    CustomerID int,
    OrderDetailID int,
    OrderTotal double,
    OrderDate DateTime,
    INDEX clx_CustomerId CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(CustomerId)

);

And repeat for many other scenarios (like per a calculated columns that contains the dayofyear)

Now the developers of the actual analytical scripts need to pick the table with partitioning scheme that fits their task, and go back and update their code with the new table name when we decide we need additional partitioning scenarios.

What I would like, and it seems like somethings that an optimizer could utilize very easily, is to allow the table/index DML contain multiple indexes like this

CREATE TABLE dbo.Orders
(
    OrderID int,
    CustomerID int,
    OrderDetailID int,
    OrderTotal double,
    OrderDate DateTime,
   INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(OrderId),
  INDEX clx_CustomerID_OrderID CLUSTERED(CustomerID, OrderID ASC) DISTRIBUTED BY HASH(CustomerID)
);

This way, we don't have to maintain two sets of table DDLs and the U-SQL developers do not have to update their code between dbo.Orders and dbo.Orders_ByCustomerId.

Example: Now we need to make different expressions, as this will be parallel

@res = REDUCE dbo.Orders_ByCustomerId  ON CustomerId
   PRODUCE CustomerId, CustomerClass int
   READONLY  CustomerId
   USING new Extension.R.Reducer(scriptFile:"CustomerClassifcation.R",rReturnType:"dataframe");

While this won't be run parallel

@res = REDUCE dbo.Orders  ON CustomerId
   PRODUCE CustomerId, CustomerClass int
   READONLY  CustomerId
   USING new Extension.R.Reducer(scriptFile:"CustomerClassifcation.R", rReturnType:"dataframe");

PS:(Scripts are examples. Not tested!)

Does anyone have suggestions on suitable workflow/pattern to overcome this limitation

1

1 Answers

2
votes

First a CLUSTERED INDEX (as in most databases) means that it represents the primary representation of the data. So having two clustered indices for the same table would not work per se (or would not be much different from you defining two tables).

We do have a work item on our backlog to add secondary (non-clustered) indices that would help with your scenario. Please add your vote to https://feedback.azure.com/forums/327234-data-lake/suggestions/10622475-multiple-indexes-on-adl-a-table.

For now the patterns are:

  1. Try to find a distribution and cluster that works for as many of your expensive queries as possible.

  2. If you are not able to find one, define multiple tables and then provide views or table-valued functions as abstractions that helps your users to choose the right one.