0
votes

I want to use AWS S3 as external storage and use Snowflake external tables to query the data. To not lose/overwrite any data, I have enabled versioning on the S3 bucket. When multiple versions of a file exists, they will show as duplicates in Snowflake and I could find an option to hide them from external stages/external tables.

LIST @my_stage; -- shows duplicate files

SELECT $1, $2, $3 FROM @my_stage; -- returns duplicate records

CREATE OR REPLACE EXTERNAL TABLE my_external_table (
    Name STRING AS (value:c1::STRING),
)
with location = @my_stage;
SELECT * FROM my_external_table; -- shows duplicate records

SELECT DISTINCT row1, row2, row3 from my_external_table; -- manually hide duplicates

Is there any way to only select rows from the latest version of a file without the need to use DISTINCT? Thanks in advance

1
I don't know S3 versioning very well, but are you able to reference each version of a file via filename or do you have to leverage filename + file id in order to get a specific file? The reason I ask is because you could create a stored procedure that removes older file versions from the metadata of the external table, if you can reference them directly by filename. - Mike Walton
It turned out there were actual duplicate files on S3 as some of them had an additional whitespace in their name, which was hard to spot: stackoverflow.com/questions/66198840/… - cedricbastin
Ah, that makes sense. I didn't think Snowflake could reference old versions of files, but wasn't sure. Good to know. - Mike Walton

1 Answers

0
votes

I turns out that there were actual duplicate files on S3, they have exactly the same name except that one had a whitespace attached at the end of the filename. I could not easily see this in the AWS Console nor via the AWS S3 CLI, but this command shows the file names with " around them:

aws s3api list-objects --bucket my_bucket