2
votes

Problem summary

I use a COPY query to load data from S3 to a Redshift table t1.

99% of the time there's no errors, data is correctly loaded. So I know there's no mistake on the loaded table name.

But when there are errors, depending on the table I loaded, I can't always correctly track which error happened during loading of this specific table because, for some loaded tables like t1, the table identifier tbl found in stl_load_errors doesn't correspond to the one expected for table t1.

I filter stl_load_errors both on filename and tbl, because a same S3 file may be loaded to different tables, so filtering on filename only isn't a safe way to check if there were errors during one specific COPY query.

Details

The loaded table name doesn't appear in this table, instead, we get an integer identifier 1234567 in the column tbl. We have to join with another table stv_tbl_perm that contains both name and identifier, to get the name. This "trick" is shown in the Redshift documentation.

For some tables, like t1, the error row I may found in stl_load_errors after using COPY t1 ... will have an identifier 1234567 in tbl column that doesn't correspond to anything found in stv_tbl_perm. As if it was the identifier of a temporary table.

Then, when I look in stv_tbl_perm for"name"='t1', I do find an identifier 10111213 in column id, but it's not the one I saw in stl_load_errors.

What makes it even more confusing is that my method of filtering errors on file and table, works perfectly fine for some Redshift tables, the id found is the one expected, it matches with the correct name in stv_tbl_perm.

Why would my COPY to a table go through what seems like a temporary table for some table, and not for others?

Reproduction

I tried to simplify my process as much as possible for sharing and managed to reproduce my issue as follows.

1. Create table, insert rows so it's not empty and check it.

CREATE TABLE IF NOT EXISTS public.t1
(
    id INTEGER,
    name VARCHAR(36),
    price NUMERIC(6,2)
);
insert into t1 VALUES (1, 'paul', 10.50);
select * from t1;

2. Create an S3 file with a deliberate type error

I write a character in the column "id" so it will cause a type error during loading.

unload ('select ''a'' as "id", ''pierre'' as "name", 2.50 as "price"')   
to 's3://my-bucket/redshift-load-error-table-id/unload/t1_' 
iam_role 'arn:aws:iam::1111111111:role/my-user'
parallel off 
delimiter ','; 

Make sure to replace my-bucket, 1111111111 and my-user with your own values.

3. Load this file into the table using COPY

COPY t1
from 's3://my-bucket/redshift-load-error-table-id/unload/t1_000'
iam_role 'arn:aws:iam::1111111111:role/my-user'
CSV;

Query fails as expected, generating a row in table stl_load_errors.

4. Check for the table id in the new error row in stl_load_errors

SELECT * 
FROM stl_load_errors 
where trim(filename) = 's3://my-bucket/redshift-load-error-table-id/unload/t1_000';

We find the row corresponding to the file and keep the table identifier found in the column tbl, let's say "tbl" = 1234567 (you'll get something different).

5. Look up this table identifier in stv_tbl_perm to get the table name

select *
from stv_tbl_perm 
where id='1234567';

This returns nothing, the identifier found doesn't correspond to any permanent table listed in stv_tbl_perm. As if it was the identifier of a temporary table.

6. Look up the expected identifier of the table we loaded to

select *
from stv_tbl_perm 
where name='t1';

This returns 10111213 (you'll get something different), and this is the identifier I should have gotten in the tbl column of table stl_load_errors.

If I try another COPY with an error, another line will appear in the error table, with again another identifier different from both previous identifier. This also indicates it could be the identifier of a temporary table.

Consequence

I therefore cannot filter stl_load_errors for the specific table I applied the COPY query on.

Surprisingly, this process works perfectly fine for some of my tables and I can't find any difference in the way I do it between the tables for which it works and the tables for which it doesn't.

I haven't found any reference to that issue on the web yet. It requires a lot of details so it's hard to search too.

Any idea?

2
I have posted my issue on the AWS Developer forum here. - NicolasPA

2 Answers

3
votes

This was identified as an actual Redshift issue due to a recent Redshift feature by an AWS engineer on the AWS developer forum where I posted my problem.

@joeharris76-AWS:

OK, thank you for bringing this to our attention. We've identified the source of this issue and will investigate a fix. When it is fixed a note will to appear in our regular maintenance announcements at the top of the forum.

This is related to the automatic DISTSTYLE feature we introduced recently. For DISTSTYLE AUTO tables that are currently ALL, we load new data into a temp table and then check if the new rows added require the table to be converted from ALL to EVEN. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

DISTSTYLE { AUTO | EVEN | KEY | ALL } The default is AUTO. … AUTO: Amazon Redshift assigns an optimal distribution style based on the table data. For example, if AUTO distribution style is specified, Amazon Redshift initially assigns ALL distribution to a small table, then changes the table to EVEN distribution when the table grows larger. The change in distribution occurs in the background, in a few seconds. …

There are 2 ways work around this:

  1. Define an explicit DISTSTYLE of EVEN, KEY, or ALL on the table being loaded. I recommend this option primarily for tables that can use a KEY diststyle. If you are not using KEY then the AUTO diststyle will be the most efficient.

  2. Retrieve the query ID from the stl_load_errors, retrieve the SQL from stl_querytext for the query, and then look up the table by name from pg_class, svv_table_info, or stv_tbl_perm.

https://forums.aws.amazon.com/thread.jspa?messageID=897976

Specifying a DISTSTYLE in my little reproduction does solve the problem and I'll apply the same solution to my production tables because they seem to fit with the EVEN style (hundreds of millions of rows, no joins).

So in my reproduction case, I add a line in the table creation:

CREATE TABLE IF NOT EXISTS public.t1
(
    id INTEGER,
    name VARCHAR(36),
    price NUMERIC(6,2)
)
DISTSTYLE EVEN;

And now the table identifier found in stl_load_errors does correspond to the expected table in stv_tbl_perm.

I have also tested the method with the querying of the stl_querytext table, and it works too in my reproduction case, but I don't find it very clean nor effective to parse a string to find the table name inside, so I'll stick with adding DISTSTYLE EVEN in my current cases.

0
votes

Instead of stv_tbl_perm use pg_class. You can use the oid (database object id) columns in the Postgres catalog tables as joining columns

select c.relname table_name, s.* 
  from stl_load_errors s, pg_class c 
 where c.oid = s.tbl
   and c.relname = '<your table name>'

For more info

https://docs.aws.amazon.com/redshift/latest/dg/c_join_PG.html

Please let me know if this solves your issue.