8
votes

We have a set of tables in Redshift with columns having IDENTITY property in it, for sequence generating. During testing phase there is a requirement of taking a backup and restore and this is a repeatative activity for each cycle of testing. We followed the below processes to take backup and then restore and faced the below issues:

  1. Traditional way: Created backup tables in another backup schema with CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ. But doing that we lost the IDENTITY and other attributes of the table. So during restore if you try to create the table from the backup directly you lose the attribute properties and YOU CAN'T ALTER to add IDENTITY constraint.
  2. Traditional way backup and a different restore method: This time we dropped and recreate the table with DDL first and then tried to perform INSERT INTO from backup. But it can't insert values into the IDENTITY columns.
  3. UNLOAD and COPY: We also tried Redshift utilities such as UNLOAD to take a backup of the table in S3 and then restore using copy. It worked fine but then we faced other issues - a. DATE fields having leading zero didn't get extracted properly in the UNLOAD extract. Ex: A Date '0001-01-01' extracted as '1-01-01'. Then it's failing during the COPY saying not a valid date. There are other several errors its throwing during the restore (COPY) such missing data for not null fields or invalid value for int datatype. Which means the UNLOAD and COPY command together don't work in sync and values change.
  4. Table restore from snapshot: I haven't tried this but i understand AWS supports table restore now. But again it's a tedious job to set up this individually for 500 tables. Also you have keep and track snapshots for long.

It will be very helpful if you could suggest the best possible way to backup and restore in my scenario OR the best practices organizations follow.

1

1 Answers

3
votes

I would like to answer here point by point so it will be bit long, please excuse me for that;), but in my opinion, I feel that the best option is Unload to S3 and Copy to table from S3. Here, S3 could be replace with EC2.

  1. Traditional way- we prefer if we need to do some data alternation and we would like to dry run our queries.
  2. Traditional way backup and a different restore method same issues as of #1, we don't use.
  3. UNLOAD and COPY: This is most convenient method and even IDENTITIES could retain, hence always preferred method.

There are some problems listed in question, but most of them are false or could be avoided by supplying proper export/import parameters. I would like to provide all necessary steps with data to prove my point that, there are no issues in dates and timestamps during the load and unload.

Here I'm doing most of data types to prove my point.

create table sales(
salesid integer not null Identity,
commission decimal(8,2),
saledate date,
description varchar(255),
created_at timestamp default sysdate,
updated_at timestamp);

Content in CSV(sales-example.txt)

salesid,commission,saledate,description,created_at,updated_at
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

Copy command that will import date, timestamps, as well as IDs.

copy sales(salesid,commission,saledate,description,created_at,updated_at) from 's3://****/de***/sales-example.txt' credentials 'aws_access_key_id=************;aws_secret_access_key=***********' IGNOREHEADER  1 EXPLICIT_IDS;

This will copy 5 records. I'm doing here parallel off to get data in single CSV to prove point, though not required and should be avoided.

unload ('select salesid,commission,saledate,description,created_at,updated_at from sales') to 's3://assortdw/development/sales-example-2.txt' credentials 'aws_access_key_id=***********;aws_secret_access_key=***********' parallel off;

And below is my content again that which is exactly same as of import, meaning if run the Copy command to any other environment say dev or QA or somewhere, I will get the exact same records as of in Redshift cluster.

5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
  1. Table restore from snapshot: This requires our `networking/infrastructure group, hence we avoid, though less sure about it. Othe experts are most wellcome to comment/share details about this.

I hope this answer the question, as well provide a start point to discuss/summarize/conclude. All are most welcome to add your points.