Postgres restore/update with WAL on cloned VM instead of using basebackup

Question

Enviroment: 800GB Postgres-Database (OpenSuse)

Normal Restore-Process:

You have pg_basebackup to restore (from let's say: every Saturday)
You have WAL files from last Saturday to today
First: Restore with pg_basebackup
Then: Update database with WAL-files to have newest data. (with recovery.conf)

My Idea:
Why do every week big pg_basebackup and copy 800GB over Internet to NAS, when you have everyday incremental Backups with some backup-software.

Restore complete database-vm (stand yesterday)
Add WAL-files (restore) to bring this vm-clone up to date.

Now i've done:

I restored a vm
create recovery.conf

restore_command = 'cp /.../%f %p'
rcpostgresql start

I get following errors:

2017-05-09 16:46:07.780 CEST [2938]: [1-1] user=,db=,app=,client= LOG:  database system was shut down at 2017-05-09 16:45:47 CEST
2017-05-09 16:46:07.780 CEST [2938]: [2-1] user=,db=,app=,client= LOG:  starting archive recovery
2017-05-09 16:46:08.588 CEST [2952]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=[local] LOG:  connection received: host=[local]
2017-05-09 16:46:08.588 CEST [2952]: [2-1] user=postgres,db=postgres,app=[unknown],client=[local] FATAL:  the database system is starting up
2017-05-09 16:46:09.391 CEST [2938]: [3-1] user=,db=,app=,client= LOG:  restored log file "000000010000070D0000008A" from archive
2017-05-09 16:46:09.434 CEST [2938]: [4-1] user=,db=,app=,client= LOG:  contrecord is requested by 70D/8A000028
2017-05-09 16:46:09.434 CEST [2938]: [5-1] user=,db=,app=,client= LOG:  invalid primary checkpoint record
2017-05-09 16:46:09.434 CEST [2938]: [6-1] user=,db=,app=,client= LOG:  invalid secondary checkpoint link in control file
2017-05-09 16:46:09.434 CEST [2938]: [7-1] user=,db=,app=,client= PANIC:  could not locate a valid checkpoint record
2017-05-09 16:46:09.434 CEST [2936]: [4-1] user=,db=,app=,client= LOG:  startup process (PID 2938) was terminated by signal 6: Aborted
2017-05-09 16:46:09.434 CEST [2936]: [5-1] user=,db=,app=,client= LOG:  aborting startup due to startup process failure

After pg_resetxlog the next WAL-file was restored. and i get same error (with next wal-file-name)

Is there any way to get this working?

So long as you are calling pg_start_backup() and pg_stop_backup(), your incremental backup covers all database files and you have all the WAL files from pg_start_backup onwards this should work. — Richard Huxton

Vao Tsun Vao Tsun · Accepted Answer · 2017-05-09T20:36:58

From your error I assume you skipped pg_start_backup. Otherwise you should have missing checkpoint:

pg_start_backup accepts an arbitrary user-defined label for the backup. (Typically this would be the name under which the backup dump file will be stored.) When used in exclusive mode, the function writes a backup label file (backup_label) and, if there are any links in the pg_tblspc/ directory, a tablespace map file (tablespace_map) into the database cluster's data directory, performs a checkpoint, and then returns the backup's starting transaction log location as text.

Following the logic the sequence should be this:

backup:
1. the day before - just before VM copy, run select pg_start_backup('some label') (make sure it returned location - it can take long time to make a savepoint, or force fast creation at IO spike price)
2. VM backup
3. select pg_stop_backup()
restore:
1. I restored a vm
2. create recovery.conf with restore_command = 'cp /.../%f %p'
3. rcpostgresql start
4. let people know if it worked

Also yo might want to read about the pg_control, chechpoints and recover sequence here.

Postgres restore/update with WAL on cloned VM instead of using basebackup

2 Answers