I have the following issue:
- I have an AWS S3 pipeline that on a daily basis a single json.gz files is spit.
- I wish to take that file with dbt and put it into snowflake (no snowpipe use atm)
I have managed to do this by creating a storage integration and I have manually created with my role (used for running dbt) a schema and assing usage on that schema. So far so good.
Then I read about this:
https://github.com/fishtown-analytics/dbt-external-tables
Problem is that this is the only way this runs properly, I had to alter my dbt profiles.yml, set the default schema to be S3_MIXPANEL with default database RAW_DEV, run a different target and role on that with --target 'ingest_dev' parameter.
I keep thinking that there should be a more sophisticated solution, where I can create schema's and query metadata and use something like {{ source() }} so I can point my documentation somehow that this is an external source. This dbt-external-tables is not really well explained for my case here I think?
Please can anyone help me and share how to create schemas and query from external stages properly without changing default schema macro & dbtprofiles.yml each time?
I have succeeded to run the following code:
{{
config(
materialized ='incremental',
schema = generate_schema_name('S3_MIXPANEL')
)
}}
SELECT
metadata$filename as file_name,
to_date(SUBSTR(metadata$filename,16,10),'yyyy/mm/dd') as event_date,
$1 as payload,
CONVERT_TIMEZONE('Europe/London',TO_TIMESTAMP_tz($1:properties:mp_processing_time_ms::int / 1000)) as event_timestamp_converted,
CONVERT_TIMEZONE('Europe/London', current_timestamp) as ingested_at
from
@my_s3_stage
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
WHERE event_date>(
SELECT
max(event_date)
FROM
{{ this }}
)
{% endif %}
{{ row_limit() }}
EDIT 22-06-20:
I have added the src_mixpanel.yml file in my models and ran the dbt command, however I had to also specify the data_types, so I added them too, then I apparently had to add the "macro" in my macros too (btw maybe a stupid question but I don't really know how to install your package, so I manually added all macros from yours into mine).
Now when I run this code:
dbt run-operation stage_external_sources
with
version: 2
sources:
- name: s3_mixpanel
database: RAW_DEV
tables:
- name: events
external:
location: '@my_s3_stage'
auto_refresh: false # depends on your S3 setup
partitions:
- name: event_date
expression: to_date(SUBSTR(metadata$filename,16,10),'yyyy/mm/dd')
data_type: date
- name: file_name
expression: metadata$filename
data_type: string
columns:
- name: properties
data_type: variant
I get an error:
Encountered an error while running operation: Compilation Error in macro stage_external_sources (macros/stage_external_sources.sql)
'dict object' has no attribute 'sources'