How to read the headers from a staged file

Question

My use case is as below. I need to load data from source to a snowflake stage. From stage, I need to load the data to snowflake tables. Now, before loading into the snowflake table, I have a need to read the columns of the files(basically the header) loaded in stage. My understanding is that if I have to read even a single row in a file in a stage(using select), that will not be efficient because the file is actually using the underlying object store of the cloud provider. If I can load the file using 'COPY INTO' to a snowflake table and then find a way to read the column names, that will be much more efficient. I wanted to check if my understanding is correct.

Should the title of the question be "How to read the headers from a staged file"? — Felipe Hoffa
You are right Felipe, I just changed the title. Do you thing we should read it from the stage or we should copy into a table and then try to read it. Since the stage is a object store, if I have many files in the stage, will not it be inefficient? — Rajib Deb
Do you plan to load each file in the stage individually or load many/all of them at the same time? Do all the files have the same headers? If you can explain why you need to read the header and what you are going to do with the information, someone may be able to come up with a better solution — NickW
Hi Nick, the columns in the file change often. So, I am trying to find a way to read the header and see if a column got added and then alter the table to add the column and then load it. I have two options use a JSON stage and a RAW table with variant column. Copy the stage to RAW. If there is a column addition, add that column in the core table and then load the core from raw. Option 2, I read the header in stage and add the column in RAW. I feel option 1 is a better choice, but wanted to get a second opinion — Rajib Deb

yoape yoape · Accepted Answer · 2021-03-24T12:45:35

This article by Craig Warman at Snowflake describes how you can automatically create a view from a JSON document/file. This can be adapted to allow you to read the json once and then automatically building up a table and a view based on the content in that file. You will need a custom JavaScript UDF

With the UDF created you can simply call:

call create_view_over_json(
  '@EXT_STAGE/sample.json', 
  'DB_NAME.MY_SCHEMA.sample', 
  'DB_NAME.MY_SCHEMA.sample_vw');

This will only generate one call to the external stage, and then two more calls: one to query the table that contains the raw json and then one to create the view for it:

After that you have the table containing the data, and a view to query directly:

SELECT * FROM DB_NAME.MY_SCHEMA.sample_vw;

And the UDF you need for all this:

create or replace procedure create_view_over_json (STAGE_FILE_NAME varchar, TABLE_NAME varchar, VIEW_NAME varchar)
returns varchar
language javascript
as
$$
// Attribution:
// Craig Warman for the original code who also leveraged code developed by Alan Eldridge.

// This generates paths with levels enclosed by double quotes (ex: "path"."to"."element").  It also strips any bracket-enclosed array element references (like "[0]")
var path_name = "regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]',''), '(\\w+)','\"\\1\"')"
var attribute_type = "DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING')";    // This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
var alias_name = "REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_')" ;                           // This generates column aliases based on the path
var col_list = "";

var json_col_name = "json_data";
// Create or replace the table based on the file
var table_ddl = "CREATE OR REPLACE TABLE " + TABLE_NAME + " AS \n" +
                "SELECT \n" +
                "   parse_json($1) AS " + json_col_name + " \n" +
                "FROM " + STAGE_FILE_NAME + "; \n"
var table_stmt = snowflake.createStatement({sqlText:table_ddl});
var table_res = table_stmt.execute();

// Build a query that returns a list of elements which will be used to build the column list for the CREATE VIEW statement
var element_query = "SELECT DISTINCT \n" +
                    path_name + " AS path_name, \n" +
                    attribute_type + " AS attribute_type, \n" +
                    alias_name + " AS alias_name \n" +
                    "FROM \n" + 
                    TABLE_NAME + ", \n" +
                    "LATERAL FLATTEN(" + json_col_name + ", RECURSIVE=>true) f \n" +
                    "WHERE TYPEOF(f.value) != 'OBJECT' \n" +
                    "AND NOT contains(f.path,'[') ";      // This prevents traversal down into arrays;

// Run the query...
var element_stmt = snowflake.createStatement({sqlText:element_query});
var element_res = element_stmt.execute();

// ...And loop through the list that was returned
while (element_res.next()) {

// Add elements and datatypes to the column list
// They will look something like this when added: 
//    col_name:"name"."first"::STRING as name_first, 
//    col_name:"name"."last"::STRING as name_last   

   if (col_list != "") {
      col_list += ", \n";}
   col_list += "\t" + json_col_name + ":" + element_res.getColumnValue(1);   // Start with the element path name
   col_list += "::" + element_res.getColumnValue(2);             // Add the datatype
   col_list += " as " + element_res.getColumnValue(3);           // And finally the element alias 
}

// Now build the CREATE VIEW statement
var view_ddl = "CREATE OR REPLACE VIEW " + VIEW_NAME + " AS \n" +
               "SELECT \n" + col_list + "\n" +
               "FROM " + TABLE_NAME;

var view_stmt = snowflake.createStatement({sqlText:view_ddl});
var view_res = view_stmt.execute();
view_res.next();

return ((view_ddl), (element_query));
$$;

Some points:

If the json is complex, you need to adapt the stored procedure to take care of that. Again, Craig to the rescue in part 2
If the json changes significantly from time to time, consider ensuring your schema does not break. Some light reading on Schema Evolution here.
This example is for a json file, the same solution can be adopted for AVRO, Parquet, CSV, even XML by changing how the CREATE TABLE statement is generated and how the query is constructed in the later part to query over respective format

How to read the headers from a staged file

1 Answers