Load CSV into Redshift, injecting additional information

Question

I have some exported_data.csv, and I have some corresponding Redshift table, and I have a COPY:

COPY(
  c1,
  c2,
  c3,
  ...
) from "s3://bucket/exported_data.csv"

What I want to do is add a column source_file to the table, and populate it with arbitrary information defined in the COPY statement (or some other set of statements); in this example, the name of the file the row came from.

I can't use UPDATE due to how much data I'm interacting with (takes just plain too long).

It doesn't look like I can just use DEFAULT, since I need to be able to change the information when the name of the source file is different, but Redshift doesn't let you change the default value of a column. (Otherwise, I'd just change the default val for the column, do the copy, change the default val, do the copy from the next CSV file...)

I'm exploring "staging tables", but figured I'd ask here if someone has a ready-made answer while I explore.

I would suggest using a staging table as well. Redshift doesn't allow to add columns in the copy statements. So you need to copy the csv file as is to a staging table, and then insert from there to your final table with the additional column. — Jon Ekiz
@JonEkiz - you are right. You should post this as an answer. — ketan vijayvargiya

Rahul Gupta Rahul Gupta · Accepted Answer · 2017-03-30T15:03:15

This is not possible. But I can suggest you 2 workaround.

Pre-Process: You would have to add that extra column with data before pushing the data into redshift which sounds like a difficult solution for your business case.

Post-Process: Staging table is the best way to do this. Push your data into redshift staging table - Process data - then push into redshift final table.
Note - Running update on redshift gets expensive as data grows, try to avoid as much as possible or else you will end up vacumming a lot which uses a lot of redshift processing resources.

If you want all your rows added in a single COPY command to have the same value of source_file, then you may COPY data into staging table, then add source_file column into that table, then insert all data from the staging table into final table like:

CREATE TABLE destination_staging (LIKE destination);
ALTER TABLE destination_staging DROP COLUMN source_file ;
COPY destination_staging FROM 's3://data/destination/(...)' (...)
ALTER TABLE destination_staging ADD COLUM source_file VARCHAR(99);
INSERT INTO destination SELECT * FROM destination_staging;
DROP TABLE destination_staging;

If I’ve made a bad assumption please comment and I’ll refocus my answer.

Load CSV into Redshift, injecting additional information

2 Answers