0
votes

I'm getting some unexpected behavior from Snowflake and I'm hoping someone could explain what's happening and the best way to handle it. Basically, I'm trying to do a nightly refresh of an entire dataset, but truncating the table and staging/copying data into it results in old data being loaded.

I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.

Step 1: Data is loaded into an empty table.

put file://test.csv @test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))

Step 2: Data is truncated

TRUNCATE test_db.test_schema.test_table

Step 3: New data is loaded into the now empty table (same filename, but overwrite set to True).

put file://test.csv @test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))

At this point, if I query the data in the table, I see that it is the data from Step 1 and not Step 3. If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3 as expected. I'm trying to understand what is happening. Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE? What's the best way to achieve the behavior that I want? Thank you!

2

2 Answers

1
votes

I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.

Are you certain you are manually committing each step, with the connection.commit() API call returning successfully?

Running your statements in the following manner reproduces your issue, understandably so because the TRUNCATE and COPY INTO TABLE statements are not auto-committed in this mode:

<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>

<BEGIN_SCRIPT 2>
[Step 2]
[Step 3]
<END_SCRIPT 2>


SELECT * FROM test_table; -- Prints rows from file in Step 1

However, modifying it to always commit changes the behaviour to the expected one:

<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>

<BEGIN_SCRIPT 2>
[Step 2]
COMMIT
[Step 3]
COMMIT
<END_SCRIPT 2>

SELECT * FROM test_table; -- Prints rows from file in Step 3

If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3

This is expected because CREATE is a DDL statement, which are always auto-committed (regardless of override) in Snowflake. Doing a CREATE in place of TRUNCATE causes a commit to happen on that step implicitly, which further reaffirms that your tests aren't properly committing at Step 2 and Step 3 somehow.

Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE?

No, if the PUT succeeds, it has performed an overwrite as instructed (assuming filenames remain the same). Older version of the stage data will no longer exist after it has been overwritten.

0
votes

Can you check below steps will fit to your requirement.

  1. Create Stage table
  2. Truncate your stage table
  3. Load nightly refresh of an entire data set to stage table.
  4. Use merge statement to copy the data from stage table to target table.(In order to merge two tables you need primary key(s))

Make sure your stage table is truncated successfully before proceeding to the next step.

Hope this helps.