4
votes

Background

I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.

Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.

The issue

Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)

This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.

This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to: 1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names. 2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.

Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.

2
Can you change the script that generates the DDL to put quotes around the column names? One word of caution, if you create columns with mixed and lower case, you will always have to quote the column name in Snowflake.Mike Gohl
It doesn't matter if I quote the column name in this case. The column is named dataValue with a capital V in the ORC file. create or replace external table ext_tablename ( datavalue timestamp as (value:"datavalue"::timestamp) ) with location = @stg file_format = (FORMAT_NAME = 'formatname' type = ORC)Cedersved
Does this give you a null? create or replace external table ext_tablename ( "dataValue" timestamp as (value:dataValue::timestamp) ) with location = @stg file_format = (FORMAT_NAME = 'formatname' type = ORC)Mike Gohl
No, that works since it's the correct casing in the value: definition in the virtual column, but the point is that I want to generate it dynamically based on the information schema information for a table and not write all the virtual column definitions manually like in the 1980's :)Cedersved

2 Answers

0
votes

Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:

CREATE TABLE "MyTable" ("Id" NUMBER);

If your dynamic SQL carefully uses "Id" and not just Id you will be fine.

0
votes

Found an even better way to achieve this, so I'm answering my own question.

With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.

SELECT f.path as "ColumnName" 
          , TYPEOF(f.value) as "DataType"
          , COUNT(1) as NbrOfRecords
FROM (
        SELECT $1 as "value" FROM @<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
    ), 
        lateral flatten(value, recursive=>true) f
        WHERE TYPEOF(f.value) != 'NULL_VALUE'
        GROUP BY f.path, TYPEOF(f.value)
        ORDER BY 1