1
votes

I have a snowflake query with multiple ctes and inserting into a table using a Talend job. It takes more than 90 minutes to execute the query. It is multiple cascading ctes, one is calling other and other is calling the other.

I want to improve the performance of the query. It is like 1000 lines of code and I can't paste it here. As I checked the profile and it is showing all the window functions and aggregate functions which slows down the query.

For example, the top slower is,

ROW_NUMBER​(​​)​ OVER ​(​PARTITION BY LOWER​(​S​.​SUBSCRIPTIONID​)​ 
                   ORDER BY S​.​ISROWCURRENT DESC NULLS FIRST, 
                      TO_NUMBER​(​S​.​STARTDATE​)​ DESC NULLS FIRST, 
                      IFF​(​S​.​ENDDATE IS NULL, '29991231', S​.​ENDDATE​)​ DESC NULLS FIRST​)​ 

takes 7.3% of the time. Can you suggest an alternative way to improve the performance of the query please?

1
How are you using this row number after computing it?Elliott Brossard
Try to avoid transforming values on the fly in PARTITION, ORDER BY, JOIN, WHERE, etc. Start by eliminating LOWER().waldente
Elliott, row number is used for the subsequent ctes and to get row_number=1 to filter out againselva

1 Answers

2
votes

The problem is that 1000 lines are very hard for any query analyzer to optimize. It also makes troubleshooting a lot harder for you and for a future team member who inherits the code.

I recommend breaking the query up and these optimizations:

  1. Use CREATE TEMPORARY TABLE AS instead of CTEs. Add ORDER BY as you create the table on the column that you will join or filter on. The temporary tables are easier for the optimizer to build and later use. The ORDER BY helps Snowflake know what to optimize for with subsequent joins to other tables. They're also easier for troubleshooting.
  2. In your example, see if you can persist this data as permanent columns so that Snowflake can skip the conversion portion and have better statistics on it: TO_NUMBER​(​S​.​STARTDATE​)​ and IFF​(​S​.​ENDDATE IS NULL, '29991231', S​.​ENDDATE​)​.
  3. Alternatively to step 2, instead of sorting by startDate and endDate, see if you can add an IDENTITY, SEQUENCE, or populate an INTEGER column which you can use as the sortkey. You can also literally name this new column sortKey. Sorting an integer will be significantly faster than running a function on a DATETIME and then ordering by it.
  4. If any of the CTEs can be changed into materialized views, they will be pre-built and significantly faster.
  5. Finally stage all of the data in a temporary table - ordered by the same columns that your target table was created in - before you insert it. This will make the insert step itself quicker and Snowflake will have an easier time handling a concurrent change to that table.

Notes:

  1. To create a temporary table: create or replace temporary table table1 as select * from dual; After that you refer to table1 instead of your code instead of the CTE.
  2. Materialized views are documented here. They are an Enterprise edition feature. They syntax is: create materialized view mymv as select col1, col2 from mytable;