1
votes

I am trying to build a backup system for some important data in my AWS S3 bucket. Among the options that I explored were versioning from which individual objects can be recovered to an earlier state. This would definitely help in the case of accidental deletions.

But the problem here is in situations where there's a data corruption happening because of some wrong code that was introduced or something similar, in order to retrieve the system to an earlier state a proper snapshot based backup solution will be required in addition to versioning. This would also help in a situation where say the whole bucket was deleted accidentally, or versioning got turned off and some data got deleted later.

The current option I was thinking of was to use an EC2 instance to copy the data daily or at predefined intervals to local drive(using aws s3 sync or aws s3 cp) and then upload it under the particular days folder to another S3 bucket. I was thinking of maintaining a life-cycle rule to expire the backups after say a week. I don't think this is very efficient though because the buckets could hold about 100GB data later as traffic increases into the application.

I wanted some validation from someone who might have done something similar if this is the right way to proceed, or if there's some S3 or AWS feature that can be used to make this simpler.

1
You can use S3 replication to replicate objects into another bucket either inside the same region or cross region. - jellycsc
With replication I think data corruption will also get replicated real-time. - HariJustForFun
I think your problem is more detect corruption rather than recover from there. The earlier you can detect corruption the faster and easy and simple the recover solution will be. Simply try to implement a fast way to detect when the data is corrupted, and using S3 versioning you can go back to the earliest stable version. Are you expecting to have the same files over time in a bucket? Because if that is not the case, then versioning may not be the best. You keep versions for things that change over time, but not for things that get created and remove on time basis. - Perimosh
I don't think detecting corruption is that straightforward to identify all possible use-cases and implement. It could be just an untested scenario in code which might mess up the data. But even if it, having a defensive strategy is good to have albeit expensive for critical data in case the need arises. - HariJustForFun

1 Answers

4
votes

Traditionally, backups are used in case a storage device is corrupted. However, Amazon S3 replicates data automatically to multiple storage devices, so this takes care of durability.

For data corruption (eg an application destroys the contents of a file), Versioning is the best option because S3 will retain previous versions of an object whether an object is updated (overwritten). Object Lifecycle Management can be used to delete versions after a certain number of versions or after a given period of time.

If you concerned that versioning might be turned off (suspended) or a whole bucket was accidentally deleted, you can use S3 replication to duplicate the contents of the bucket to another bucket. The other bucket can even be in a different region or a different AWS Account, which means that nobody in the primary account would have permission to delete data in the secondary (replication) account. This is a common practice to ensure critical business data is not lost.

If you want the ability to restore multiple objects to a point-in-time ("retrieve the system to an earlier state"), you can use traditional backup software that is S3-aware. For example, MSP Backup (formerly CloudBerry Lab) has backup software that can move data between S3 buckets and on-premises storage (or just within S3), with normal point-in-time restore capabilities.