How to improve the performance of a SQL query in Snowflake, when joining multiple small tables?

Question

I am currently working with DBs in Snowflake, and I have a performance problem.

In my query, I need to retrieve 100 rows of data from a MAIN_TABLE which contains foreign keys to other tables. MAIN_TABLE size is between 50K-300K rows.

This MAIN_TABLE contains ids (foreign keys) of type NUMBER(38,0), and the description of those fields/columns is retrieved from other tables with a straightforward INNER JOIN.

The secondary tables (table_one, table_two...) are very small, all of them have less than 20 rows, except one, that has around 1500 rows (still very small)

The problem is:

Running the query, as shown below, takes an average of 1.7s to complete. If I delete all the DESC fields (retrieved using INNER JOINs), and all the INNER JOINs, the query takes an average of 300ms to complete.

The current solution my peers implemented:

They have cached the information of all the secondary tables in our app (made in Java, but that is irrelevant). For about 80% of the cases (see query below), that we do not need to filter by DESCription, this works, but on the other 20% of cases we still have to perform the INNER JOINs.

Downsizes of the solution:

Is not a bad solution, but it has two problems:

(not so bad) it bloats our codebase, is basically a "JOIN" made in Java
the 20% of the time we need to filter by DESCriptions, we still need to do the INNER JOINs, so we still have a performance problem.

One possibility that comes to mind is caching those tables in SF, but I haven't found a straightforward way to do it yet. Maybe there is a way to optimize the query, but I do not yet understand how Snowflake works internally, as far as I know it does not uses Indexes, at least not in the same way other platforms do.

So, is there a way to optimize the query in SF for 100% of the cases?

SELECT 
  main_table.ONE_ID, 
  main_table.TWO_ID, 
  main_table.THREE_ID, 
  main_table.FOUR_ID, 
  main_table.FIVE_ID, 
  main_table.SIX_ID, 
  main_table.SEVEN_ID, 
  field_one.ONE_DESC, 
  field_two.TWO_DESC, 
  field_three.THREE_DESC,
  field_four.FOUR_DESC, 
  field_five.FIVE_DESC, 
  field_six.SIX_DESC, 
  field_seven.SEVEN_DESC

FROM 
  SOME_DATABASE.MAIN_TABLE AS main_table 

  INNER JOIN SOME_DATABASE.TABLE_ONE    AS table_one    ON main_table.field_one_id =    table_one    .ONE_ID
  INNER JOIN SOME_DATABASE.TABLE_TWO    AS table_two    ON main_table.field_two_id =    table_two    .TWO_ID
  INNER JOIN SOME_DATABASE.TABLE_THREE  AS table_tree   ON main_table.field_tree_id =   table_tree   .THREE_ID
  INNER JOIN SOME_DATABASE.TABLE_FOUR   AS table_four   ON main_table.field_four_id =   table_four   .FOUR_ID
  INNER JOIN SOME_DATABASE.TABLE_FIVE   AS table_five   ON main_table.field_five_id =   table_five   .FIVE_ID
  INNER JOIN SOME_DATABASE.TABLE_SIX    AS table_six    ON main_table.field_six_id =    table_six    .SIX_ID
  INNER JOIN SOME_DATABASE.TABLE_SEVEN  AS table_seven  ON main_table.field_seven_id =  table_seven  .SEVEN_ID

WHERE 
  main_table.ONE_ID IN (25, 26) 
  AND main_table.TWO_ID IN (10, 12) 
  AND main_table.THREE_ID IN (1, 2, 3) 
  AND main_table.FOUR_ID IN (2, 3) 
  AND main_table.FIVE_ID IN (3) 
  AND main_table.SEVEN_ID IN (1) 

  -- The following WHERE clauses are present in about 20% of the queries
  AND table_one.ONE_DESC, 
  AND table_two.TWO_DESC, 
  AND table_three.THREE_DESC,

ORDER BY 
  main_table.ONE_ID, 
  main_table.TWO_ID, 
  main_table.THREE_ID

LIMIT 
  100 OFFSET 0

Small update:

I have been trying using WITH clauses, both wrapping the ids and the description, but there does not seem to be an improvement

When you are running these tests, are you seeing a lot of cache being used? What is the pruning of each table in the query profile? Have you tried creating a new table with the DESC fields in the fact and compared that, as well? — Mike Walton
Have you examined the query profile to see what's taking the most time? You're probably on the right track using a CTE, and should start by taking out the small tables/joins with long processes in the profile into their own sections of the CTE. — Greg Pavlik

Gordon Linoff Gordon Linoff · Accepted Answer · 2020-06-19T11:41:08

I'm just curious if it would make a difference if you phrased the query with the filtering on the main table in subqueries:

FROM (SELECT *
      FROM main_table
      WHERE main_table.ONE_ID IN (25, 26) AND
            main_table.TWO_ID IN (10, 12) AND
            main_table.THREE_ID IN (1, 2, 3) AND
            main_table.FOUR_ID IN (2, 3) AND
            main_table.FIVE_ID IN (3) AND
            main_table.SEVEN_ID IN (1) 
     ) main_table JOIN
    . . .

One of the downsides to tremendous scalability is that the methods used can add overhead to smaller queries. That is, if you multiplied your table sizes by 1000, the query might still only take a few seconds -- but that doesn't help you eke out faster times on the smaller data.

How to improve the performance of a SQL query in Snowflake, when joining multiple small tables?

3 Answers