Optimize data extraction while reading from ADL table

Question

We are inserting data in ADL table using round-robin distribution scheme. In another job, we extract data from the table for three different partitions and observed an uneven number of vertices for partitions. For example, in one partition it creates 56 vertices for 264 GB data and in another partition, it creates 2 vertices for 209 GB data. Partition with few vertices took huge time to complete. In attached picture, I am not sure why SV5 and SV3 have only 2 vertices. Is there any way to optimize this and increase the number of vertices for these partitions?

Here is a script for table:

CREATE TABLE IF NOT EXISTS dbo.<tablename>
(
abc string,
def string,
<Other columns>
xyz int,
INDEX clx_abc_def CLUSTERED(abc, def ASC)
)
PARTITIONED BY (xyz)
DISTRIBUTED BY ROUND ROBIN;

Update:

Here is a script for data insertion:

INSERT INTO dbo.<tablename>
    (
    abc,
    def,
    <Other columns>
    xyz
    )
    ON INTEGRITY VIOLATION IGNORE
SELECT *
FROM @logs;

I am doing multiple (maximum 3) inserts in a partition. But in another job, I am also selecting data, doing some processing, truncating partition and then inserting data back to the partition. I want to know why default distribution scheme of Round Robin is creating only 2 distributions for SV5 and SV3? I am hoping to have more distributions for this amount of data.

Marc Friedman Marc Friedman · Accepted Answer · 2017-05-23T04:20:36

Given that you insert in different ways, it looks like sometimes, like in the script that INSERTs the data that SV1 is reading, those scripts get a good estimate, while others cause U-SQL to do very badly. When you use round robin, but do not specify a distribution, U-SQL will pick one for you based on compile-time estimated data size. This is also true for HASH and DIRECT HASH. The most rock-solid mitigation for this is to specify the number of distributions with the INTO clause whenever you have a pretty good idea of what distribution you want. Anything from 50-200 looks like it will keep you in the sweet spot.

Optimize data extraction while reading from ADL table

2 Answers