3
votes

TLDR

How can I make sure the datetime values in my parquet file are copied into a snowflake table properly?

Description

I am using a parquet file to upsert data to a stage in snowflake. That file is then used to COPY INTO a snowflake table.

All works well, except datetime values: Depending on whether I use fastparquet or pyarrow to save the parquet file locally, the datetime values are correct or not (data type is TIMESTAMP_NTZ(9) in snowflake):

# With fastparquet:
ID    SOME_TIMESTAMP
----------- 
1     2020-02-02 00:35:28.000
2     2020-02-02 00:35:31.857
3     2020-02-02 00:35:21.000

# With pyarrow:
ID    SOME_TIMESTAMP
----------- 
1     52057-05-03 07:06:40.000
2     52057-05-03 08:10:57.000
3     52057-05-03 05:10:00.000

In pandas, however, the dataframes are identical with both having the same dtypes: datetime64[ns]

1 2020-02-02 00:35:28.000
2 2020-02-02 00:35:31.857
3 2020-02-02 00:35:21.000

Therefore, I believe the problem is rooted in the way fastparquet and pyarrow deal with datetime values and subsequently how snowflake reads the parquet files.

Solution ideas

At this point, I can think of three ways to approach this problem:

  1. Make sure that the parquet files are saved with fastparquet. Even though I cannot control this right away, I could always open each parquet file with pd.read_parquet() and save it with pd.to_parquet(fname, engine='fastparquet'). Even though this works, I would really like to avoid this option, since the files can get very large.
  2. Try to figure out how the files were saved, using meta information. If the engine used is not fastparquet, open & save the file with fastparquet. This would drastically reduce the problem I pointed out above, because I only open & save for files that I know will cause trouble.
  3. Make sure that Snowflake handles the timestamp values correctly.

To go with option 2, I need somehow need to distinguish (in Python) how the parquet was saved.

Other alleys I went down

Snowflake allows me to specify the compression used, like COMPRESSION = AUTO (default) or COMPRESSION = SNAPPY, as described here. However, AFAIK both pyarrow and fastparquet use this compression. Also, the AUTO option should capture this, according to the documentation.

1

1 Answers

1
votes

Although I'm not 100% sure what pandas or python are doing. I've seen this before and for us it was due to epoch time attempting to be converted to timestamps. When this data gets passed in, it seems like it is getting converted to epoch then being CAST (::TIMESTAMP_NTZ) to TIMESTAMP_NTZ which for some reason is giving it a scale of 0 instead of 9.

I recommend using a function to CONVERT ([TRY_]TO_TIMSTAMP(< value >,[scale/format]) the value into a timestamp. This will give you the control to tell the function what type of data is being passed in and how it should interpret that data. With CAST you're leaving it to Snowflake to decide.

Here is some example code that hopefully helps to visualize the situation. I used your timestamp examples, which are initially built as a string. I then converted (and cast for completeness) to a timestamp, then to milliseconds, then used milliseconds to get back to timestamp.

SELECT $1 AS C1
    ,$2 AS C2
    // strings cast or converted to timestamps both result in the same output, so we can use either to calculate the MS from EPOCH start
    ,$2::TIMESTAMP_NTZ AS STRING_CAST_TO_DTTM
    ,TO_TIMESTAMP_NTZ($2) AS STRING_CONVERT_TO_DTTM
    ,DATEDIFF('MS','0',STRING_CAST_TO_DTTM) AS MS
//    ,DATEDIFF('MS','0',STRING_CONVERT_TO_DTTM) AS MS
    // the lines below are integers being CONVERTED to timestamp and the scale is an explicit parameter of the function; note that everything converts correctly BASED ON SCALE
    ,TO_TIMESTAMP_NTZ(MS,0) AS CONVERT_NTZ0
    ,TO_TIMESTAMP_NTZ(MS,3) AS CONVERT_NTZ3
    ,TO_TIMESTAMP_NTZ(MS,6) AS CONVERT_NTZ6
    ,TO_TIMESTAMP_NTZ(MS,9) AS CONVERT_NTZ9
    // the lines below are integers being CAST to timestamp; note that the scale does not matter everything is returned as TIMESTAMP_NTZ(0)
    ,MS::TIMESTAMP_NTZ(0) AS CAST_NTZ0
    ,MS::TIMESTAMP_NTZ(3) AS CAST_NTZ3
    ,MS::TIMESTAMP_NTZ(6) AS CAST_NTZ6
    ,MS::TIMESTAMP_NTZ(9) AS CAST_NTZ9
FROM VALUES 
(1,'2020-02-02 00:35:28.000')
,(2,'2020-02-02 00:35:31.857')
,(3,'2020-02-02 00:35:21.000')
;

// use this to see the resulting data types for each column
DESC RESULT LAST_QUERY_ID();

Comment if you have any questions about the above code.