1
votes

Currently, we are using Hadoop and Snowflake for storing our data.

The process is Copy the Hadoop ORC files to Snowflake s3 location using DISTCP and then run the Copy into Snowflake table from S3. This will copy everything that is there in the Hadoop ORC table to Snowflake table.

Now, I have a new requirement wherein My Hadoop table is a transactional table and existing entries are getting updated every hour. If I copy the ORC files to S3 and run the Copy command, it adds up more entries to the existing table and not update the existing 1s.

How can I solve this problem in Snowflake?

3

3 Answers

1
votes

Typically three strategies are used for something like this:

  • reload everything periodically - depending on your amount of data, it might be possible and fast (Snowflake is known to be able to load multiple TBs per hour). You can use a combo of CREATE TABLE LIKE, COPY and SWAP to do it transactionally.
  • identify and export only the rows that are new - not sure what's your data source, but you might be able to do it
  • on loading into Snowflake, do not insert existing rows. You need to have some way to determine which rows are the same, e.g. an ID field (or a combination of fields). Then you can COPY into a temp table, and then use MERGE to merge new/modified rows (see the documentation for various options possible)
0
votes

It's bit late but one possible and more accurate solution would be :

  1. Create transient table on fly and copy your hourly transition data to this transient tables.
  2. Merge into final table : From the transient table created in step-1, MERGE into final table based in your PK.

This is typical case of maintaining snapshot tables vs journal tables. To avoid step-1 manually, you can also take advantage of SNOWPIPE.

-1
votes

In such cases use Snowflake MERGE command to update/insert rows to target table. In addition to that, you can explore Snowflake CDC to load only delta of changes between previous and current moment: https://docs.snowflake.net/manuals/user-guide/streams.html