0
votes

We have our ETL jobs running for sometimes in snowflake. We have two warehouses medium and 2xl. One of the rule that we are following is that if a query runs less than 10 minutes we move to medium warehouse, anything more than that goes to 2XL. But is this the right way to segregate.

I know Snowflake parallizes the query based on core availability. For example

2XL cluster has 32 nodes, 8 cores each. Snowflake tries to split every query to 256 parts. So for example if we run a query:

select sum(a) from mytable;

Then each of the 256 cores scans 1/256 of the table.

But how do I know if a table can be split up into 256 cores, it may not have enough data to be split up. In that case it does not make sense to run the query on 2XL.

Thanks

2

2 Answers

0
votes

This is all rather a subjective question.

If you are running both the medium and the 2XL at the same time, why not run everything on the 2XL and save the Medium, where-as if you spin up/down the 2XL if it's keep up less that 60 seconds you are paying for the 60 seconds. Which for queries that take > 10min in perfectly linear fashion, will take over a 1 minute.

How you know if it can be split is partly theoretical, is it inherently parallel

select id, count(*) from table group by id;

where you have many id, very parallel, even if you have only one id, this is still parallelizable due to counts not conflicting.. where-as

select id, count(distinct column2) ...

needs to build the set of column2 per id, thus not much is to be gained by 32 instances. But still the IO load transforms might be the costly part..

so it depends on the constraints of the query, being run, and the data it working across. Which means you should run the queries on different sized servers to see if for your data loads it scales.

0
votes

If your ETL processes are part of a frequently executed workstream, then the best way to approach this is to test every query on both a Medium and an 2XL and determine whether you are getting the best bang for the buck. I assume you don't just have both warehouses running all the time and want to figure out the best way to let these warehouses startup and shutdown during your workstream. Generally, testing each query is the best way to go. Otherwise, you can look at extreme cases (where a Medium is creating spillage, you'd definitely want a larger warehouse, for example). Also, the number of micro-partitions being read during your query execution will give you an idea of how many threads a warehouse might be using. If you have less micro-partitions then you have threads, then you're using too much warehouse.

Lastly, it's unusual to have such a big gap between your "small" workload and your "large" workload. I'd recommend evaluation Large and XL size warehouses, as well. It costs you NOTHING to configure more warehouses and have them available to you. Just make your you are shutting them down when the queries are complete to avoid additional up-time from your auto-suspend.