1
votes

I'm trying to set up a simple DBT pipeline that uses a parquet tables stored on Azure Data Lake Storage and creates another tables that is also going to be stored in the same location.

Under my models/ (which is defined as my sources path) I have 2 files datalake.yml and orders.sql. datalake.yml looks like this:

version:2
sources:
   - name: datalake
     tables:
        - name: customers
          external:
             location: path/to/storage1 # I got this by from file properties in Azure
             file_format: parquet
          columns:
             - name: id
               data_type: int
               description: "ID"
             - name: ...

My orders.sql table looks like this:

{{config(materialized='table', file_format='parquet', location_root='path/to/storage2')}}
select name, age from {{ source('datalake', 'customers') }}

I'm also using the dbt-external-tables package. Also note that when I run dbt debug everything is fine and I can connect to my database (which happens to be Databricks).

I tried running dbt run-operation stage_external_sources which returns Error: staging external sources is not implemented for the default adapter. When I run dbt run, I get Error: UnresolvedRelation datalake.customers.

Or perhaps I could make use of the hive metastore instead somehow? Any tips on how I could fix this would be highly appreciated!

2

2 Answers

1
votes

This might not yet be available. Looks like it's still an open issue with no dev work so far.

Related issue for dbt-external-tables package repo: Support Spark external tables

Do you have the dependencies from dbt-spark installed?

Here are some relevant issues there:

Spark_connection_url do not contain workspace_id while connecting to databricks

Support Delta Lake format

I realize those aren't exactly help with the easy dbt-external-tables use-case but it looks like development is still on-going to support the azure databricks / datalake stack.

Gonna try to dig into this a bit more later because this is a use-case that's relevant to me also.

1
votes

I help maintain both the dbt-spark plugin and the dbt-external-tables package. I can confirm that their interoperability is still preliminary, and I'd strongly welcome contributions to improve it. I don't think it's a big lift, though one of the challenges is the fact that Spark/Databricks supports two different create external table syntaxes (as noted in that issue).

FWIW I see that you specified path/to/storage as the external location of your source and the location_root config of your orders model. The former is the place to read data from, and the latter is the place to materialize the model as a table. I wasn't sure if you meant that to represent the same placeholder or different ones.

Edit: TIL that you can select from some file types directly in SparkSQL as select * from filetype.filepath. I believe you could register a source like:

version:2
sources:
 - name: datalake
   schema: parquet
   tables:
     - name: customers
       identifier: "path/to/storage1"
       quoting:
         identifier: true

This means you could have templated code like:

select * from {{ source('datalake', 'customers') }}

Which will resolve to:

select * from parquet.`path/to/storage1`