0
votes

I'm doing some POC to load a Snowflake table from a dataframe in Databricks. I've successfully loaded the table, however it changes the structure of it.

For example, in Snowflake I created this table:

CREATE OR REPLACE TABLE FNT_DAYS
(
    FNT_DT_PK TIMESTAMP_NTZ NOT NULL,
    OPEN_DT_FLG VARCHAR(1),
    HOLIDAY_DT_FLG VARCHAR(1),
    LOAD_USR VARCHAR(10)
);

ALTER TABLE FNT_DAYS ADD CONSTRAINT FNT_DAYS_PK PRIMARY KEY (FNT_DT_PK);

When running my code in Databricks unsing Python, the table gets successfully loaded, however the structure of the table changes to this:

CREATE OR REPLACE TABLE FNT_DAYS
    (
        FNT_DT_PK TIMESTAMP_NTZ,
        OPEN_DT_FLG VARCHAR(16777216),
        HOLIDAY_DT_FLG VARCHAR(16777216),
        LOAD_USR VARCHAR(10)
    );

Note that the Primary Key Constraint is gone, FNT_DT_PK field is not longer NOT NULL and finally, every single VARCHAR field data type length is changed to 16777216.

My python code in Databricks is very straight forward:

%python

options = dict(sfUrl="mysnflk.snowflakecomputing.com",
               sfUser="me",
               sfPassword="******",
               sfDatabase="SNF_DB",
               sfSchema="PUBLIC",
               sfWarehouse="SNF_WH")

df = spark.sql("select * from exc.test")

df.write \
  .format("snowflake") \
  .mode("overwrite") \
  .options(**options) \
  .option("dbtable", "FNT_DAYS") \
  .save()

Do you have an idea of why the table structure is changed in Snowflake?

1

1 Answers

0
votes

If you look at the query_history in Snowflake, do you see that the table is being recreated by the df.write command? It seems that it is recreating the table and using the datatypes of the dataframe to define your new table. I don't know exactly what is causing that, but I do see that the Snowflake example (https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#id1) has a slightly different syntax on your mode.

I should also note that the length of those varchar field will not hurt you in any way in Snowflake. Length does not affect storage or performance and those lengths mean that the connector is literally just using VARCHAR as the data type without a length specified. Also, constraints on PK are not enforced, so not sure how important that is to you. The only thing I'd be concerned about is your NOT NULL, which Snowflake does enforce.