2
votes

I’m trying to set up daily backups (using Persistent Disk snapshots) for a PostgreSQL instance I’m running on Google Compute Engine and whose data directory lives on a Persistent Disk.

Now, according to the Persistent Disk Backups blog post, I should:

  • stop my application (PostgreSQL)
  • fsfreeze my file system to prevent further modifications and flush pending blocks to disk
  • take a Persistent Disk snapshot
  • unfreeze my filesystem
  • start my application (PostgreSQL)

This obviously brings with it some downtime (each of the steps took from seconds to minutes in my tests) that I’d like to avoid or at least minimize.

The steps of the blog post are labeled as necessary to ensure the snapshot is consistent (I’m assuming on the filesystem level), but I’m not interested in a clean filesystem, I’m interested in being able to restore all the data that’s in my PostgreSQL instance from such a snapshot.

PostgreSQL uses fsync when committing, so all data which PostgreSQL acknowledges as committed has made its way to the disk already (fsync goes to the disk).

For the purpose of this discussion, I think it makes sense to compare a Persistent Disk snapshot without stopping PostgreSQL and without using fsfreeze with a filesystem on a disk that has just experienced an unexpected power outage.

After reading https://wiki.postgresql.org/wiki/Corruption and http://www.postgresql.org/docs/current/static/wal-reliability.html, my understanding is that all committed data should survive an unexpected power outage.

My questions are:

  1. Is my comparison with an unexpected power outage accurate or am I missing anything?

  2. Can I take snapshots without stopping PostgreSQL and without using fsfreeze or am I missing some side-effect?

  3. If the answer to the above is that I shouldn’t just take a snapshot, would it be idiomatic to create another Persistent Disk, periodically use pg_dumpall(1) to dump the entire database and then snapshot that other Persistent Disk?

2

2 Answers

2
votes

1) Yes, though it should be even safer to take a snapshot. The fsfreeze stuff is really to be 100% safe (anecdotally: I never use fsfreeze on my PDs and have not run into issues)

2) Yes, but there is no 100% guarantee that it will always work (paranoid solution: take a snapshot, spin up a temp VM with that snapshot, check the disk is ok, and delete the VM. This can be automated)

3) No, I would not recommend this over snapshots. It will take a lot more time, might degrade your DB performance, and what happens if something happens in the middle of a dump? Also, PDs are very expensive for incremental backups. Snapshots are diffed, so you don't have to pay for the whole disk every copy (just the first one), only the changes.

Possible recommendation:

Do #3, but then create a snapshot of the new PD and then delete the PD.

0
votes

https://cloud.google.com/compute/docs/disks/persistent-disks#creating_snapshots has recently been updated and now includes this new paragraph:

If you skip this step, only data which was successfully flushed to disk by the application will be included in the snapshot. The application experiences this scenario as if it was a sudden power outage.

So the answers to my original questions are:

  1. Yes
  2. Yes
  3. N/A, since the answer to ② is Yes.