In USQL why Partition clause is there when Distribution is Round Robin?

Question

If I mention that do partition by dateid and distribution by RoundRobin into 250 and I ingest data of more 300 different dates, then whether it creates 300 partitions or only 250? I am little confused about round robin with partition clause.

I am using below clause.

PARTITIONED BY (dateId) DISTRIBUTED BY ROUND ROBIN INTO 250;

Michael Rys Michael Rys · Accepted Answer · 2018-01-14T07:42:40

Partition and Distribution are two different levels of partitioning. So you will get 300 partitions each with 250 distribution "buckets".

Partitions are optional and are used for data live cycle management and can be addressed, created and deleted individually. They also support partition elimination during queries. Currently only single value partitions are supported.

Distributions are primary for query optimizations and have different schemes. Each partition will be organized according to the distribution. Specifying a distribution scheme is mandatory.

Normally you should use either HASH or RANGE distributions. ROUND ROBIN is only advisable in the context of high data skew where the skewed data would negate any benefits of the other schemes.

More details can be found in https://www.slideshare.net/MichaelRys/tuning-and-optimizing-usql-queries-sqlpass-2016

I also recommend that you study the pointers you have received in an earlier, related question at https://stackoverflow.com/a/47562997/1318169

In USQL why Partition clause is there when Distribution is Round Robin?

1 Answers