0
votes

I am using the dbt CLI to regularly update data via dbt run. However, this materializes several tables, and can take 20+ hours to do a full refresh.

I am currently running this from my PC/cloud VM, but I don't want to keep my PC on / VM running just to run the dbt CLI. Moreover, I've been burned before by trying to do this (brief Wi-Fi issue interrupting a dbt operation 10h into a 12h table materialization).

Are there any good patterns for this? Note that I'm using SQL Server which is not supported by DBT cloud.

I've considered:

  1. Setting up airflow / prefect
  2. Having a small vm just for DBT to run
  3. Moving to a faster database (eg. from Azure SQL to Azure Synapse)

Any ideas?

1
What do your models look like in terms of complexity? If they're particularly large tables, are you running full refreshes, or updating them incrementally? If there is any room for query optimization, I'd want to start there - You can always move to a faster Database or increase compute/ram, but that comes with its cost. To the point of the dbt operation interrupted after 10h or so, you could split the dbt run commands so that you're only running perhaps a schema at a time. That way you won't lose everything in the case of a wifi interruption or similar.Branden Ciranni

1 Answers

0
votes

I would agree here with Branden. Throwing resources should be the last resort. First thing that you should do is try optimizing sql queries. If the queries are optimized the time for full refresh takes will depend upon data volume. If the volume is high you should be doing incremental runs rather than full refreshes. You can schedule for incremental runs using something like cron scheduler or airflow

Another thing to note is you don't need to do dbt run if you want to run selected models. You can always do dbt run -m +model

+model -> run the model with all upstream dependencies

model- -> run the model with all downstream dependencies

Another aspect, since you're using SQL Server which is row store (more suited to ETL) you also need some dimensional modeling. dbt is the T of ELT in which data is already loaded in a powerful column store warehouse (like snowflake/redshift) and dimensional modeling is not needed as queries already leverage the columnar storage. Doesn't mean dbt cannot work with row stores but dimensional modeling may be needed.

2nd. You can always have a small VM or run it on something like ECS Fargate. This is a serverless solution and you're charged only when dbt runs.

Finally, if nothing works then you should consider of moving to something like Synapse that will likely use compute intense resources to run queries faster.