Using external parquet tables in a DBT pipeline

Question

I'm trying to set up a simple DBT pipeline that uses a parquet tables stored on Azure Data Lake Storage and creates another tables that is also going to be stored in the same location.

Under my models/ (which is defined as my sources path) I have 2 files datalake.yml and orders.sql. datalake.yml looks like this:

version:2
sources:
   - name: datalake
     tables:
        - name: customers
          external:
             location: path/to/storage1 # I got this by from file properties in Azure
             file_format: parquet
          columns:
             - name: id
               data_type: int
               description: "ID"
             - name: ...

My orders.sql table looks like this:

{{config(materialized='table', file_format='parquet', location_root='path/to/storage2')}}
select name, age from {{ source('datalake', 'customers') }}

I'm also using the dbt-external-tables package. Also note that when I run dbt debug everything is fine and I can connect to my database (which happens to be Databricks).

I tried running dbt run-operation stage_external_sources which returns Error: staging external sources is not implemented for the default adapter. When I run dbt run, I get Error: UnresolvedRelation datalake.customers.

Or perhaps I could make use of the hive metastore instead somehow? Any tips on how I could fix this would be highly appreciated!

sgoley sgoley · Accepted Answer · 2020-08-19T16:20:05

This might not yet be available. Looks like it's still an open issue with no dev work so far.

Related issue for dbt-external-tables package repo: Support Spark external tables

Do you have the dependencies from dbt-spark installed?

Here are some relevant issues there:

Spark_connection_url do not contain workspace_id while connecting to databricks

Support Delta Lake format

I realize those aren't exactly help with the easy dbt-external-tables use-case but it looks like development is still on-going to support the azure databricks / datalake stack.

Gonna try to dig into this a bit more later because this is a use-case that's relevant to me also.

Using external parquet tables in a DBT pipeline

2 Answers