Do joins on multiple columns store the cartesian product?

Question

I transliterated a script from T-SQL to U-SQL and ran into an issue running the job, namely that it seemed to get "stuck" on one of the stages - after 2.5 hours the job graph showed it had read in 200MB and had written over 3TB but wasn't anywhere near finished. (Didn't take a screenshot, sorry.)

I tracked it down to one of the queries joining a table with 34 million rows twice to a table with 1600 rows:

@ProblemQuery = 
  SELECT
    gp.[Group],      // 16 groups
    gp.[Percentile], // 1-100
    my_fn(lt1.[Value], lt2.[Value], gp.[Value]) AS CalculatedNumber
  FROM
    @LargeTable AS lt1
    INNER JOIN @GroupPercent AS gp
      ON lt1.[Group] == gp.[Group]
      AND lt1.[Row ID] == gp.[Row ID 1]
    INNER JOIN @Large Table AS lt2
      ON gp.[Group] == lt2.[Group]
      AND gp.[Row ID 2] == lt2.[Row ID]
;

It seems that the full cartesian product (~2e18 rows) is being stored during the processing rather than just the filtered 1600 rows. My first thought was that it might be from using AND rather than &&, but changing that made no difference.

I've managed to work around this by splitting the one query with two joins in to two queries with one join each, and the whole job completed in under 15 minutes without a storage blowout.

But it's not clear to me whether this is fully expected behaviour when multiple columns are used in the join or a bug, and whether there's a better approach to this sort of thing. I've got another similar query to split up (with more joins, and more columns in the join condition) and I can't help but feel there's got to be a less messy way of doing this.

Michael Rys Michael Rys · Accepted Answer · 2018-05-23T05:37:08

U-SQL applies some join reorder heuristics (although I don't know how it deals with the apparent self-join). I doubt it is related to you using multiple columns in the join predicate. I assume that our heuristic may be off. Can you please either file an incident or send me the job link to [usql] at microsoft dot com? That way we can investigate what causes the optimizer to pick the worse plan.

Until then, splitting the joins into two statements and thus forcing the better join order is the best workaround.

Do joins on multiple columns store the cartesian product?

1 Answers