1
votes

We have an aws s3 bucket where same file is getting rewritten every day. And we need to load that file in snowflake which is only loaded today.

Copy command is just working fine. however when a new file is not loaded it just load the previous one which is not needed. instead it should throw an error for the team to know.

We can query the last_modified date:

ls @stage_name

or we can get the stage metadata information.

By query

select metadata@filename from @stage_name

But I am not able to find any way of querying or filtering last_modified column of stage.

1
Can't you remove the stage file after it is loaded to snowflake?Rajib Deb
What is the COPY INTO command you are using? I don't think your issue is with the last_modified date, but rather how you are handling errors in the COPY command.Mike Walton
loading same file isn't a duplicate loading, something snowflake should throw error while copy command ?PIG
@RajibDeb thats the challenge, we dont have permission to remove or move file in S3danD
@PIG copying the same file does not throw error. it just dont loads file unless we use force=true; if we get error message instead it will solve the issue, as the team will get notified.danD

1 Answers

1
votes

The ls command is a metadata query. While you can't filter on that, you can filter on the result set that it generates:

ls @MYSTAGE;
-- You must run this immediately after the previous query in the same session
select * from table(result_scan(last_query_id())); 

There are two complications though. First, the name of the metadata columns is lowercase, so be sure to use lower case names in your filters and wrap them in double quotes. The second problem is the date field is not in a format that lends itself well to filtering in a query. To address this problem, I wrote a UDF to convert last_modified fields to a timestamp for easier filtering:

-- Convert the last modified value from the Snowflake LIST
-- command into a timestamp.
create or replace function LAST_MODIFIED_TO_TIMESTAMP(LAST_MODIFIED string) 
returns timestamp_tz
as
$$
    to_timestamp_tz(left(LAST_MODIFIED, len(LAST_MODIFIED) - 4) || ' ' || '00:00', 'DY, DD MON YYYY HH:MI:SS TZH:TZM')
$$;

Then you can filter the results like this:

ls @TESTSTAGE;

select * from table(result_scan(last_query_id())) 
where LAST_MODIFIED_TO_TIMESTAMP("last_modified") < '2020-11-01' ;