Some people may slip into your head the idea that by doing snapshots, you’re free from the burden of doing proper backups. While this may sound good in theory, in practice there are a bunch of caveats. There are certain technologies that use the snapshot methodology at the core, but they make sure that your data isn’t corrupted. Some may even provide access to the actual file revisions.
The data corruption is the specific topic that snapshots simply don’t care about, at least in Amazon’s way of doing things. This isn’t exactly Amazon’s fault for EC2. EBS actually stands for Elastic Block Storage. They provide you a block storage, you do whatever you want with it. For RDS they should do a better job though as it’s a managed service where you don’t have access to the actual instance. The issue is those ‘specialists’ that put emphasis onto the ‘easy, cloud-ish way’ of doing backups by using snapshots. If you’re new to the ‘cloud’ stuff as I used to be, you may actually believe that crap. As I used to believe.
A couple of real life examples:
- An EBS-backed instance suffered some filesystem level corruption. Since EXT3 is not as smart as ZFS if we’re talking about silent data corruption, you may never know until it’s too late. Going back through revisions in order to find the last good piece of data is a pain. I could fix the filesystem corruption, I could retrieve the lost data, but I had to work quite a lot for that. Luck is an important skill, but I’d rather not put all my eggs into the luck basket.
- An RDS instance ran out of space. There wasn’t a notification to tell me: ‘yo dumbass, ya ran out of space’. Statistically it wasn’t the case, but a huge data import proved me wrong. I increased the available storage. Problem solved. A day later, somebody dropped by accident a couple of tables. I had to restore them. How? Take the latest snapshot, spin up a new instance, dig through the data. The latest snapshot contained a couple of corrupted databases due to the space issue, one of them being the database I needed to restore. I had to take a bunch of time in order to repair the database before the restoration process. Fortunately nothing really bad happened. But it was a signal that the RDS snapshot methodology is broken by design.
Lesson learned. The current way of doing backups puts the data, not the block storage, first. If you’re doing EBS snapshots as the sole method, you may need to rethink your strategy.