In Azure Data Lake Analytics (ADLA) you can only define one index on a table, and it has to be clustered. It you have two kind of analysis that requires partitioning on different keys to be efficient, you have to duplicate the table create scripts etc to get things running parallel.
E.g. The simple Order example
CREATE TABLE dbo.Orders
(
OrderID int,
CustomerID int,
OrderDetailID int,
OrderTotal double,
OrderDate DateTime,
INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(OrderId)
);
This works fine when your queries partitions/pivots around OrderId. But what if you have a query that is based on CustomerID or ranges of CustomerID?
What I do currently is to do stuff like this
CREATE TABLE dbo.Orders_ByCustomerId
(
OrderID int,
CustomerID int,
OrderDetailID int,
OrderTotal double,
OrderDate DateTime,
INDEX clx_CustomerId CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(CustomerId)
);
And repeat for many other scenarios (like per a calculated columns that contains the dayofyear)
Now the developers of the actual analytical scripts need to pick the table with partitioning scheme that fits their task, and go back and update their code with the new table name when we decide we need additional partitioning scenarios.
What I would like, and it seems like somethings that an optimizer could utilize very easily, is to allow the table/index DML contain multiple indexes like this
CREATE TABLE dbo.Orders
(
OrderID int,
CustomerID int,
OrderDetailID int,
OrderTotal double,
OrderDate DateTime,
INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC) DISTRIBUTED BY HASH(OrderId),
INDEX clx_CustomerID_OrderID CLUSTERED(CustomerID, OrderID ASC) DISTRIBUTED BY HASH(CustomerID)
);
This way, we don't have to maintain two sets of table DDLs and the U-SQL developers do not have to update their code between dbo.Orders and dbo.Orders_ByCustomerId.
Example: Now we need to make different expressions, as this will be parallel
@res = REDUCE dbo.Orders_ByCustomerId ON CustomerId
PRODUCE CustomerId, CustomerClass int
READONLY CustomerId
USING new Extension.R.Reducer(scriptFile:"CustomerClassifcation.R",rReturnType:"dataframe");
While this won't be run parallel
@res = REDUCE dbo.Orders ON CustomerId
PRODUCE CustomerId, CustomerClass int
READONLY CustomerId
USING new Extension.R.Reducer(scriptFile:"CustomerClassifcation.R", rReturnType:"dataframe");
PS:(Scripts are examples. Not tested!)
Does anyone have suggestions on suitable workflow/pattern to overcome this limitation