3
votes

Being new to Snowflake I am trying to understand how to write JavaScript based Stored Procedures (SP) to take advantage of multi-thread/parallel processing. My background is SQL Server and writing SP, taking advantage of performance feature such as degrees of parallelism, worker threads, indexing, column store segment elimination.

I am started to get accustomed to setting up the storage and using clustering keys, micro partitioning, and any other performance feature available, but I don't get how Snowflake SPs break down a given SQL statement into parallel streams. I am struggling to find any documentation to explain the internal workings. My concern is producing SPs that serialise everything on one thread and become bottlenecks. I am wondering if I am applying the correct technique/ need a different mindset to developing SPs. I hope I have explained my concern sufficiently. In essence I am building a PoC to migrate an on-premise SQL Server DWH ETL solution to Snowflake/Matillion ELT solution, one aspect being evaluating the compute virtual warehouse size I need.

1
Are you wanting the SP to execute multiple SQL statements in parallel, or are you simply interested in making sure the SQL statements themselves are multi-threaded? If the latter, then Snowflake handles that for you and you don't need to worry about it.Mike Walton
Hi Mike, thanks for replying.Paul Grubb
Hi Mike, thanks for replying. Definitely interested in the latter. I want to find out more about how Snowflake achieves this, rather than just accepting it. There must be, for instance, best practice on how to write SPs, level of complex v over use of storage (temp storage), where, from a move from ETL to ELT point of view, is control of the whole process maintained (development Frameworks). I have so many questions about a migration, test frameworks, the CI/CD pipeline, it is hard to now where to start.Paul Grubb
You may want to reach out to a sales rep from Snowflake. Snowflake has Sales Engineers that can assist you with your POC, as well as Professional Services that can come in and teach best practices or even assist in designing your eventual solution.Mike Walton
Sounds like good advice Mike. Will have to see if the purse strings can be loosened.Paul Grubb

1 Answers

1
votes

stateless UDF will run in parallel by default, this what I observed when do large amount of binary data importing via base64 encoding.

stateful UDF's run in parallel on the date as controlled by the PARTION BY and ORDER BY clauses used on the data. The only trick to remember is to always force initialize your data, as the javascript instance can be used on subsequent PARTITON BY batches, thus don't rely on check for undefined to know if it's the first row.