TLDR
How can I make sure the datetime values in my parquet file are copied into a snowflake table properly?
Description
I am using a parquet file to upsert data to a stage in snowflake. That file is then used to COPY INTO a snowflake table.
All works well, except datetime values: Depending on whether I use fastparquet
or pyarrow
to save the parquet file locally, the datetime values are correct or not (data type is TIMESTAMP_NTZ(9)
in snowflake):
# With fastparquet:
ID SOME_TIMESTAMP
-----------
1 2020-02-02 00:35:28.000
2 2020-02-02 00:35:31.857
3 2020-02-02 00:35:21.000
# With pyarrow:
ID SOME_TIMESTAMP
-----------
1 52057-05-03 07:06:40.000
2 52057-05-03 08:10:57.000
3 52057-05-03 05:10:00.000
In pandas, however, the dataframes are identical with both having the same dtypes: datetime64[ns]
1 2020-02-02 00:35:28.000
2 2020-02-02 00:35:31.857
3 2020-02-02 00:35:21.000
Therefore, I believe the problem is rooted in the way fastparquet and pyarrow deal with datetime values and subsequently how snowflake reads the parquet files.
Solution ideas
At this point, I can think of three ways to approach this problem:
- Make sure that the parquet files are saved with
fastparquet
. Even though I cannot control this right away, I could always open each parquet file withpd.read_parquet()
and save it withpd.to_parquet(fname, engine='fastparquet')
. Even though this works, I would really like to avoid this option, since the files can get very large. - Try to figure out how the files were saved, using meta information. If the engine used is not
fastparquet
, open & save the file withfastparquet
. This would drastically reduce the problem I pointed out above, because I only open & save for files that I know will cause trouble. - Make sure that Snowflake handles the timestamp values correctly.
To go with option 2, I need somehow need to distinguish (in Python) how the parquet was saved.
Other alleys I went down
Snowflake allows me to specify the compression used, like COMPRESSION = AUTO
(default) or COMPRESSION = SNAPPY
, as described here. However, AFAIK both pyarrow
and fastparquet
use this compression. Also, the AUTO
option should capture this, according to the documentation.