5
votes

We're in the process of running a handful of hourly scripts on our Redshift cluster which build summary tables for data consumers. After assembling a staging table, the script then runs a transaction which deletes the existing table and replaces it with the staging table, as such:

BEGIN;

DROP TABLE IF EXISTS public.data_facts;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;

COMMIT;

The problem with this operation is that long-running analysis queries will place an AccessShareLock on public.data_facts, preventing it from being dropped and thrashing our ETL cycle. I'm thinking a better solution would be one which renames the existing table, as such:

ALTER TABLE public.data_facts RENAME TO data_facts_old;
ALTER TABLE public.data_facts_stage RENAME TO data_facts;
DROP TABLE public.data_facts_old;

However, this approach presupposes that 1) public.data_facts exists, and 2) public.data_facts_old does not exist.

Do you know if there's a way to conduct this operation safely in SQL, without relying on application logic? (eg. something like ALTER TABLE IF EXISTS).

2
doesn't AccessShareLock prevent renaming the table as well?AlexYes
Ugh, yes, that appears to be the case.Ross W.
do you have rewrite the table completely or the historical part of your table is stable?AlexYes
As it stands, the current process involves rewriting the table.Ross W.
Another down side on this approach is that views seem to stick to the table as it gets renamed, and then you can't drop it. (Look at views without schema binding it you're in this space). Personally, I haven't found anything better than the OP's approach + without schema binding. Still hoping...dsz

2 Answers

2
votes

I haven't tried it but looking at the documentation of CREATE VIEW it seems that this can be done with late-binding views.

The main idea would be a view public.data_facts that users interact with. Behind the scenes, you can load new data and then swap the view to “point” to the new table.

Bootstrap

-- load data into public.data_facts_v0
CREATE VIEW public.data_facts AS
SELECT * from public.data_facts_v0 WITH NO SCHEMA BINDING;

Update

-- load data into public.data_facts_v1
CREATE OR REPLACE VIEW public.data_facts AS
SELECT * from public.data_facts_v1 WITH NO SCHEMA BINDING;
DROP TABLE public.data_facts_v0;

The WITH NO SCHEMA BINDING means the view will be late-binding. “A late-binding view doesn't check the underlying database objects, such as tables and other views, until the view is queried.” This means the update can even introduce a table with renamed columns or a completely new structure.

Notes:

  1. It might be a good idea to wrap the swap operations into a transaction to make sure we don't drop the previous table if the VIEW swap failed.
1
votes

You can add a new load time timestamp encode runlength default getdate() column to your target table, and make your ETL do this:

INSERT INTO public.data_facts
SELECT * FROM public.data_facts_staging;
DELETE FROM public.data_facts
WHERE load_time<(select max(load_time) from public.data_facts);
DROP TABLE public.data_facts_staging;

note: public.data_facts_staging should have exactly the same structure as public.data_facts except that the last column of public.data_facts is load_time, so that on insert it will be populated with the current timestamp.

The only implication is that it would require extra disk space for a moment between you insert new rows and delete the old rows, and load_time has to be always the last column. Also you have to vaccum table every time you do this.

Another good thing about this is that if your ETL fails and staging table is empty or there is no staging table you won't lose your data. In the pure SQL scenario of swapping tables with DDL you're not protected from dropping the target table when staging table is missing. In the suggested scenario if no new rows are inserted the delete statement deletes nothing (there are no rows less than max load time), so worst case is just having the old version of data.

p.s. there is a command that instead of insert ... select ... just changes the pointer from staging to target table (alter table ... append from ...) but it requires the same type of lock as alter table I guess, so I don't suggest this