0
votes

I am currently working with DBs in Snowflake, and I have a performance problem.

In my query, I need to retrieve 100 rows of data from a MAIN_TABLE which contains foreign keys to other tables. MAIN_TABLE size is between 50K-300K rows.

This MAIN_TABLE contains ids (foreign keys) of type NUMBER(38,0), and the description of those fields/columns is retrieved from other tables with a straightforward INNER JOIN.

The secondary tables (table_one, table_two...) are very small, all of them have less than 20 rows, except one, that has around 1500 rows (still very small)

The problem is:

Running the query, as shown below, takes an average of 1.7s to complete. If I delete all the DESC fields (retrieved using INNER JOINs), and all the INNER JOINs, the query takes an average of 300ms to complete.

The current solution my peers implemented:

They have cached the information of all the secondary tables in our app (made in Java, but that is irrelevant). For about 80% of the cases (see query below), that we do not need to filter by DESCription, this works, but on the other 20% of cases we still have to perform the INNER JOINs.

Downsizes of the solution:

Is not a bad solution, but it has two problems:

  1. (not so bad) it bloats our codebase, is basically a "JOIN" made in Java
  2. the 20% of the time we need to filter by DESCriptions, we still need to do the INNER JOINs, so we still have a performance problem.

One possibility that comes to mind is caching those tables in SF, but I haven't found a straightforward way to do it yet. Maybe there is a way to optimize the query, but I do not yet understand how Snowflake works internally, as far as I know it does not uses Indexes, at least not in the same way other platforms do.

So, is there a way to optimize the query in SF for 100% of the cases?

SELECT 
  main_table.ONE_ID, 
  main_table.TWO_ID, 
  main_table.THREE_ID, 
  main_table.FOUR_ID, 
  main_table.FIVE_ID, 
  main_table.SIX_ID, 
  main_table.SEVEN_ID, 
  field_one.ONE_DESC, 
  field_two.TWO_DESC, 
  field_three.THREE_DESC,
  field_four.FOUR_DESC, 
  field_five.FIVE_DESC, 
  field_six.SIX_DESC, 
  field_seven.SEVEN_DESC

FROM 
  SOME_DATABASE.MAIN_TABLE AS main_table 

  INNER JOIN SOME_DATABASE.TABLE_ONE    AS table_one    ON main_table.field_one_id =    table_one    .ONE_ID
  INNER JOIN SOME_DATABASE.TABLE_TWO    AS table_two    ON main_table.field_two_id =    table_two    .TWO_ID
  INNER JOIN SOME_DATABASE.TABLE_THREE  AS table_tree   ON main_table.field_tree_id =   table_tree   .THREE_ID
  INNER JOIN SOME_DATABASE.TABLE_FOUR   AS table_four   ON main_table.field_four_id =   table_four   .FOUR_ID
  INNER JOIN SOME_DATABASE.TABLE_FIVE   AS table_five   ON main_table.field_five_id =   table_five   .FIVE_ID
  INNER JOIN SOME_DATABASE.TABLE_SIX    AS table_six    ON main_table.field_six_id =    table_six    .SIX_ID
  INNER JOIN SOME_DATABASE.TABLE_SEVEN  AS table_seven  ON main_table.field_seven_id =  table_seven  .SEVEN_ID

WHERE 
  main_table.ONE_ID IN (25, 26) 
  AND main_table.TWO_ID IN (10, 12) 
  AND main_table.THREE_ID IN (1, 2, 3) 
  AND main_table.FOUR_ID IN (2, 3) 
  AND main_table.FIVE_ID IN (3) 
  AND main_table.SEVEN_ID IN (1) 

  -- The following WHERE clauses are present in about 20% of the queries
  AND table_one.ONE_DESC, 
  AND table_two.TWO_DESC, 
  AND table_three.THREE_DESC,

ORDER BY 
  main_table.ONE_ID, 
  main_table.TWO_ID, 
  main_table.THREE_ID

LIMIT 
  100 OFFSET 0

Small update:

I have been trying using WITH clauses, both wrapping the ids and the description, but there does not seem to be an improvement

3
When you are running these tests, are you seeing a lot of cache being used? What is the pruning of each table in the query profile? Have you tried creating a new table with the DESC fields in the fact and compared that, as well?Mike Walton
Have you examined the query profile to see what's taking the most time? You're probably on the right track using a CTE, and should start by taking out the small tables/joins with long processes in the profile into their own sections of the CTE.Greg Pavlik

3 Answers

1
votes

I'm just curious if it would make a difference if you phrased the query with the filtering on the main table in subqueries:

FROM (SELECT *
      FROM main_table
      WHERE main_table.ONE_ID IN (25, 26) AND
            main_table.TWO_ID IN (10, 12) AND
            main_table.THREE_ID IN (1, 2, 3) AND
            main_table.FOUR_ID IN (2, 3) AND
            main_table.FIVE_ID IN (3) AND
            main_table.SEVEN_ID IN (1) 
     ) main_table JOIN
    . . . 

One of the downsides to tremendous scalability is that the methods used can add overhead to smaller queries. That is, if you multiplied your table sizes by 1000, the query might still only take a few seconds -- but that doesn't help you eke out faster times on the smaller data.

0
votes

Have you by chance tried to redesign your dictionary tables? As in you have a:

  • DT1, DT2, DT3 with key, value each

and you can try make it into

  • DT dictionary_id, key, value

this time instead of 5 small tables you'll have 1 slightly bigger and a bit more complex join conditions because you'd have to add dictionary ID to each of them, but still all of them are likely going to be retrievable in a single block of data instead of 5 blocks

0
votes

My first though is, snowflake is not designed to ask tiny queries like this fast, it made to answer big queries. Thus to me 1.xs is fine response time.

But given you would like it to be faster..

I would first check to if the slow query's compile time is the same as the fast. Because if your tables are fragmented more meta data will need to be read.

Next I would look at profile of the execution, to see where the time is spent. In the past we have found things like

SELECT columnA, columnB FROM table where ID == 1
UNION ALL 
SELECT columnA, columnB FROM table where ID == 2 

to be faster than

SELECT columnA, columnB FROM table where ID in (1,2)

oh huge SQL statements we have found Gordon's point of explicitly putting WHERE clauses at the time of the SELECT from a table helps where the optimizer does get confused.

But another aspect might just be the extra data takes longer to transmit to you client (if you are using client time as the judge) or if you looking at the performance in the WebUI, the more data you read on the longer it takes to run the query. So even if you build a new table

CREATE TABLE testo AS
SELECT 
  main_table.ONE_ID, 
  main_table.TWO_ID, 
  main_table.THREE_ID, 
  main_table.FOUR_ID, 
  main_table.FIVE_ID, 
  main_table.SIX_ID, 
  main_table.SEVEN_ID, 
  table_one.ONE_DESC, 
  table_two.TWO_DESC, 
  table_three.THREE_DESC,
  table_four.FOUR_DESC, 
  table_five.FIVE_DESC, 
  table_six.SIX_DESC, 
  table_seven.SEVEN_DESC
FROM 
    SOME_DATABASE.MAIN_TABLE AS main_table 
INNER JOIN SOME_DATABASE.TABLE_ONE AS table_one    
    ON main_table.field_one_id = table_one.ONE_ID
INNER JOIN SOME_DATABASE.TABLE_TWO AS table_two
    ON main_table.field_two_id = table_two.TWO_ID
INNER JOIN SOME_DATABASE.TABLE_THREE AS table_tree
    ON main_table.field_tree_id = table_tree.THREE_ID
INNER JOIN SOME_DATABASE.TABLE_FOUR AS table_four
    ON main_table.field_four_id = table_four.FOUR_ID
INNER JOIN SOME_DATABASE.TABLE_FIVE AS table_five
    ON main_table.field_five_id = table_five.FIVE_ID
INNER JOIN SOME_DATABASE.TABLE_SIX AS table_six
    ON main_table.field_six_id = table_six.SIX_ID
INNER JOIN SOME_DATABASE.TABLE_SEVEN AS table_seven  
    ON main_table.field_seven_id = table_seven.SEVEN_ID

and then do

    ONE_ID, 
    TWO_ID, 
    THREE_ID, 
    FOUR_ID, 
    FIVE_ID, 
    SIX_ID, 
    SEVEN_ID, 
    ONE_DESC, 
    TWO_DESC, 
    THREE_DESC,
    FOUR_DESC, 
    FIVE_DESC, 
    SIX_DESC, 
    SEVEN_DESC
FROM testo
WHERE 
    ONE_ID IN (25, 26) 
    AND TWO_ID IN (10, 12) 
    AND THREE_ID IN (1, 2, 3) 
    AND FOUR_ID IN (2, 3) 
    AND FIVE_ID IN (3) 
    AND SEVEN_ID IN (1)   

will take longer than:

    ONE_ID, 
    TWO_ID, 
    THREE_ID, 
    FOUR_ID, 
    FIVE_ID, 
    SIX_ID, 
    SEVEN_ID
FROM testo
WHERE 
    ONE_ID IN (25, 26) 
    AND TWO_ID IN (10, 12) 
    AND THREE_ID IN (1, 2, 3) 
    AND FOUR_ID IN (2, 3) 
    AND FIVE_ID IN (3) 
    AND SEVEN_ID IN (1)   

and finally, I know it's example code, but there is little point aliasing a table from main_table to main_table as that was the alias/name already in place.