The following is a cautionary tale – please learn from it.
Last Friday was a fun day. In late morning a router upstream from our server went out, when it was restored all of our sites came back online except the server we used for the Crowd Favorite web site and email relay. The hard drive in that server had failed.
The server was pretty old (I’d been using it for at least 5 years); we had been planning to migrate off of it for a while, just hadn’t gotten around to it. Obviously I wish I’d made this a higher priority and done the move earlier.
So we lost the hard drive – completely dead. Data might be recoverable, but would take a bunch of time. Here’s the kicker – we didn’t have a complete recent backup. More on that in a minute.
The immediate problems that had to be solved were getting our email functional again and getting the web site back up and running. We immediately moved our mail to Google Apps (something else that had been on the list for a while) and set up the web site on another web server. Mail was back up and running within an hour or so, the web site took about 4 hours (including DNS migration).
The 4 hours to get the web site back wasn’t a full recovery – more time was spent on that this weekend. Our backups were outdated, but the site components were almost all versioned in SVN so very little if anything was lost from the actual files. What we did lose was data from some of the databases that was backed up, but not to a separate location. As you may suspect, having database backups on the same hard drive doesn’t help much when the hard drive dies. We were covered in the event of data corruption, but not for total hard drive failure.
I’ve had data loss before. It sucks. I’m pretty paranoid about backups now as a result. While I’m annoyed that a failure happened in one of the only places that would cause real pain, it’s also a cold slap in the face – in a good way.
The irony of course, is that Crowd Favorite runs a backup service that would have avoided all of this pain: BackupMoxie.
Not only would BackupMoxie ensured that we had recent backups, but also that we had backups of things properly configured on the server for easy restoration. Time to recovery is important, and is a big part of why we built BackupMoxie the way we did. It’s taken a good bit of time to get all the parts of the site back in place.1
So yes, BackupMoxie would have saved me a good number of hours this weekend; along with some data. And I have no excuses.
- This is also one of the reasons why we delayed moving and decomissioning the server in the first place – the new server is set up differently and required reconfiguring a bunch of file paths, etc. [back]