1
votes

We are inserting data in ADL table using round-robin distribution scheme. In another job, we extract data from the table for three different partitions and observed an uneven number of vertices for partitions. For example, in one partition it creates 56 vertices for 264 GB data and in another partition, it creates 2 vertices for 209 GB data. Partition with few vertices took huge time to complete. In attached picture, I am not sure why SV5 and SV3 have only 2 vertices. Is there any way to optimize this and increase the number of vertices for these partitions?enter image description here

Here is a script for table:

CREATE TABLE IF NOT EXISTS dbo.<tablename>
(
abc string,
def string,
<Other columns>
xyz int,
INDEX clx_abc_def CLUSTERED(abc, def ASC)
)
PARTITIONED BY (xyz)
DISTRIBUTED BY ROUND ROBIN;

Update:

Here is a script for data insertion:

INSERT INTO dbo.<tablename>
    (
    abc,
    def,
    <Other columns>
    xyz
    )
    ON INTEGRITY VIOLATION IGNORE
SELECT *
FROM @logs;

I am doing multiple (maximum 3) inserts in a partition. But in another job, I am also selecting data, doing some processing, truncating partition and then inserting data back to the partition. I want to know why default distribution scheme of Round Robin is creating only 2 distributions for SV5 and SV3? I am hoping to have more distributions for this amount of data.

2
Your image is not showing.Michael Rys

2 Answers

1
votes

Given that you insert in different ways, it looks like sometimes, like in the script that INSERTs the data that SV1 is reading, those scripts get a good estimate, while others cause U-SQL to do very badly. When you use round robin, but do not specify a distribution, U-SQL will pick one for you based on compile-time estimated data size. This is also true for HASH and DIRECT HASH. The most rock-solid mitigation for this is to specify the number of distributions with the INTO clause whenever you have a pretty good idea of what distribution you want. Anything from 50-200 looks like it will keep you in the sweet spot.

0
votes

I see that you use both partitions and distributions inside the partitions.

Do you insert the data all at once into the partition or do you have multiple INSERT statements per partition?

If the later, please note that each INSERT statement adds a new file to the partition that then gets processed by its own vertex.

Also, the ROUND ROBIN distribution applies to each partition file individually.

So you may end up with a lot of distribution groups that are extracted.

If my interpretation of your scenario is correct, please use ALTER TABLE REBUILD to compact the partitions.