4
votes

Building my question on How to run DBT in airflow without copying our repo, I am currently running airflow and syncing the dags via git. I am considering different option to include DBT within my workflow. One suggestion by louis_guitton is to Dockerize the DBT project, and run it in Airflow via the Docker Operator.

I have no prior experience using the Docker Operator in Airflow or generally DBT. I am wondering if anyone has tried or can provide some insights about their experience incorporating that workflow, my main questions are:

  1. Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
  2. Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
  3. How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
1

1 Answers

5
votes

Judging by your questions, you would benefit from trying to dockerise dbt on its own, independently from airflow. A lot of your questions would disappear. But here are my answers anyway.

  1. Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)

I suggest you build one docker image for the entire project. The docker image can be based on the python image since dbt is a python CLI tool. You then use the CMD arguments of the docker image to run any dbt command you would run outside docker. Please remember the syntax of docker run (which has nothing to do with dbt): you can specify any COMMAND you wand to run at invocation time

$ docker run [OPTIONS] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]

Also, the first hit on Google for "docker dbt" is this dockerfile that can get you started

  1. Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?

Again, it's not a dbt question but rather a docker question or an airflow question.

Can you see the logs in the airflow UI when using a DockerOperator? Yes, see this how to blog post with screenshots.

Can you access logs from a docker container? Yes, Docker containers emit logs to stdout and stderr output streams (which you can see in airflow, since airflow picks this up). But logs are also stored in JSON files on the host machine in a folder /var/lib/docker/containers/. If you have any advanced needs, you can pick up those logs with a tool (or a simple BashOperator or PythonOperator) and do what you need with it.

  1. How would partial pipelines be run? (example: wanting to run only a part of the pipeline)

See answer 1, you would run your docker dbt image with the command

$ docker run my-dbt-image dbt run -m stg_customers