4
votes

I'm trying to insert data from a Pandas dataframe into a table in Snowflake, and I'm having trouble figuring out how to do it properly. To start with, I have a created a table in Snowflake that has some columns of type VARIANT. For example:

CREATE OR REPLACE TABLE
    mydatabase.myschema.results(
        DATE date, 
        PRODUCT string, 
        PRODUCT_DETAILS variant, 
        ANALYSIS_META variant,
        PRICE float
)

Then in Pandas, I have a dataframe like this:

import pandas as pd
record = {'DATE': '2020-11-05',
 'PRODUCT': 'blue_banana',
 'PRODUCT_DETAILS': "{'is_blue': True, 'is_kiwi': nan}",
 'ANALYSIS_META': "None",
 'PRICE': 13.02}
df = pd.DataFrame(record, index=[0])

As you see, I've encoded VARIANT columns as strings, as that's what I understood from the snowflake-connector documentation, that a Snowflake VARIANT type maps to str dtype in Pandas and vice-versa.

So, what I've tried to far is the following:

from snowflake.connector import pandas_tools

pandas_tools.write_pandas(
                conn=conn,
                df=df,
                table_name="results",
                schema="myschema",
                database="mydatabase")

And this does work, returning

(True,
 1,
 1,
 [('czeau/file0.txt', 'LOADED', 1, 1, 1, 0, None, None, None, None)])

However, the results I get in Snowflake are not of the proper VARIANT type. The field ANALYSIS_META is correctly NULL, but the field PRODUCT_DETAILS is of type str. See: enter image description here

(also, for example this query throws an error:

SELECT * FROM
MYDATABASE.MYSCHEMA.RESULTS
WHERE PRODUCT_DETAILS:is_blue -- should work for json/variant fields

So with all that, my question is: how should I properly format my Pandas dataframe in order to insert he VARIANT fields correctly as nested fields into a Snowflake table? I thought that casting a dictionary into a string would do the trick, but apparently it doesn't work as I expected. What I am missing here?

2
What is the error message you are getting from SELECT * FROM MYDATABASE.MYSCHEMA.RESULTS?demircioglu
@demircioglu It says SQL compilation error: Invalid data type [VARIANT] for predicate [GET(TESTING.PRODUCT_DETAILS, 'is_blue')]tania
Try this, should work SELECT * FROM MYDATABASE.MYSCHEMA.RESULTS WHERE parse_json(PRODUCT_DETAILS):is_bluedemircioglu
@demircioglu I'm still getting the same error as above (SQL compilation error: Invalid data type [VARIANT] for predicate [GET(PARSE_JSON(TO_CHAR(TESTING.PRODUCT_DETAILS)), 'is_blue')]), but even if it worked, I'm interested in making sure the type is properly inserted as VARIANT. Some other tables in Snowflake that I can see are correctly inserted as VARIANT type are highlighted blue and you can click on the values when you preview, as well as extract the nested subfields with the : notation. I would like to achieve that without having to cast into json again when selecting.tania

2 Answers

1
votes

After some investigation, I found the following solution to work:

1. Ensure that the columns are json-compliant

The key here is that json.dumps will transform your data to the right format (the right quotations, representation of null and such).

import pandas as pd
import json
record = {'DATE': '2020-11-05',
 'PRODUCT': 'blue_banana',
 'PRODUCT_DETAILS': json.dumps({'is_blue': True, 'is_kiwi': None}),
 'ANALYSIS_META': json.dumps(None),
 'PRICE': 13.02}
df = pd.DataFrame(record, index=[0])

2. Ensure you use parse_json and INSERT iteratively

Instead of using write_pandas as tried originally, we can INSERT into the table row by row, making sure to specify parse_json on the columns of desired VARIANT type, while also encoding the value as a string (by putting ' marks around it). The caveat is that this solution would be very slow if you have large amounts of data.

sql = """INSERT INTO MYDATABASE.MYSCHEMA.RESULTS
SELECT
 to_date('{DATE}'),
 '{PRODUCT}',
 parse_json('{PRODUCT_DETAILS}'),
 parse_json('{ANALYSIS_META}'),
 {PRICE}
"""
### CREATE A SNOWFLAKE CONN...

for i, r in df.iterrows():
    conn.cursor().execute(sql.format(**dict(r)))
0
votes

I think there are two problems coming together here. Firstly, the syntax for inserting variants is a bit odd. Secondly, it's important to realise that single quotes ' and double quotes " are not the same in terms of Snowflake (or SQL in general?).

Problem statement

Instead of a pandas dataframe, we could also solve this for a dictionary first. It will be straightforward to extend it to a row in a pandas dataframe.

The goal is to write this query:

"""
INSERT INTO mydatabase.myschema.results (
     date, product, product_details, analysis_meta, price
)
SELECT
    '2018-01-01',
    'my_new_products',
    parse_json('{"is_blue": true, "is_kiwi": null}'),
    parse_json('{}'),
    13.02
"""

To achieve this, you have to make sure that the " and ' don"t get mixed up and that the dictionary values are wrapped in "parse_json".


Proposed solution

A general solution could be based on (un)zipping the items in record and manipulating them a bit along the way. First, we need a function that will process our (to be inserted) values, based on their data type:

def process_val(v) -> str:
   """ Wrap dicts and string values for SQL INSERT statement """
   if isinstance(v, dict):
       return f"parse_json({json.dumps(v)})"
   elif isinstance(v, str):
       return f"'{v}'"
   else:
       return v

If you stumble upon new types of values, you could add them process_val(). We can use process_val() while joining together the processed values:

def create_insert_statement(table: str, record: Dict[str, Any]) -> str:
    columns, values = zip(*record.items())

    col_string = ", ".join(columns)
    val_string = ", ".join(process_val(v) for v in values)

    return f"INSERT INTO {table} ({col_string}) SELECT {val_string}"

You can now iteratively insert your records:

table = "mydatabase.myschema.results"

for i, r in df.iterrows():
    query = create_insert_statement(table=table, record=dict(r))
    conn.cursor().execute(query)