0
votes

While improving the performance of a Kusto query, I came across the shuffle strategy for join/summarize. I can clearly see the performance benefits of this strategy for my query, which has high cardinality for the join/summarize key.

While reading the shuffle query Kusto documentation, it seemed that the strategy will be ignored when there are nested shuffle operators.

When the shuffled operator has other shuffle-able operators, like summarize or join, the query becomes more complex and then hint.strategy=shuffle won't be applied.

My query uses nested summarize and join (with shuffle) but I also clearly see performance gains. My query pattern:

Table1
| summarize hint.strategy=shuffle arg_max(Timestamp) by Device, Interface
| join hint.strategy=shuffle (Table2) on Device, Interface

Does a query like this benefit from shuffling?

Also, does the Kusto query planner avoid any problematic shuffle if present always? Basically I wanted to rest assured that there might only be perf issues with a wrongly used/authored shuffle and not data issues.

1

1 Answers

3
votes

Please note that the article of shuffle query suggests to use hint.shufflekey in case you have nested summarize/join operators but it requires that the nested summarize/join operators have the same group-by/join key.

so in your example above, apply the following (I'm assumging that Device has a high cardinality (and you can remove/keep the shuffle strategy from the summarize, keeping/removing it will be the same behavior as long as you specify the shuffle key on the join which wraps this summarize):

Table1 | summarize arg_max(Timestamp) by Device, Interface | join hint.shufflekey=Device (Table2) on Device, Interface