Posted by rah003
on May 7, 2009 at 6:51 AM PDT
Can there ever be too many backups? I don't think so. On the other hand very often I have seen people underestimating testing their backup solutions and restore procedures and discovering issues only in middle of a crisis when restore of previously made backup is desperately (and quickly) needed.
Or can there ever be too many backups? I don't think so. On the other hand very often I have seen people underestimating testing their backup solutions and restore procedures and discovering issues only in middle of a crisis when restore of previously made backup is desperately (and quickly) needed.
Just few examples of those:
- Backup is seemingly running, but backup media is corrupted is a way that read of backed up data is not possible. (No, I'm not making this one up, I've actually seen it happen. For nearly a 6 months everybody was happy there is database backup running and they discovered the issue only when the backup could not be recovered from the tape.)
- Backup is made of the corrupted data. Typical scenario here is to have automated backup, keeping history of backups few days back, and automatically replacing old data snapshots with new ones. If the issue is not in plain sight and is discovered too late, chances are, all the existing backups have already corrupted data in. (And yes, I saw this one happen too, and more then once.)
- Backup is made periodically, it runs without any issues, but no one ever tried to do the restore, so there is no restore procedure in place for when the time comes to actually restore the data and in a worse case, there is some element missing or not functional in given configuration which makes it impossible to perform the restore. (In one case I saw this, the backed up data didn't include binaries stored outside of database, thus having all the meta data and relations between binaries backed up, but not the actual content of files). And actually even if the backup is alright, this is a dangerous scenario. When you are performing restore of the backup, it usually means something went wrong and everybody is if not nervous, then at the very least tense and prone to oversights and mistakes. Having tested procedure with clearly outlined steps describing what to do, when and how is in such a case worth of all the money.
- Surely there are more scenarios you can think of. My general thinking about backup & restore procedures goes along the lines "If something can go wrong it will and you better be ready for this."
Why did I bring this topic up at all. Recently I was party to one such recovery. The public web site was running, but it was corrupted. One subtree of the site could not be updated any more due to data corruption. There were multiple public instances, but all of them suffered from the same problem. The good backup data were way too old (couple of weeks, is way too much for site updated on daily bases).
The only good thing was, that the authoring instance was still intact and working. Piece of cake you might think. Just set up new clean public instance and republish everything and you are done. Yeah, you might think that, but ... there were several "but"s in this case. To name just a few:
- Existing configuration, while only partially updatable, had to be kept working and running and updated until replacement was ready.
- The ongoing editorial process and automated data publishing made it impossible to disconnect existing public instance during process of creating new public instance. The content had to be pushed to the still working majority of the site.
- The site was moderately big - couple of tens of thousands of assets (pages, images, proprietary data, etc.) and republishing all that takes some time.
- And, the most importantly, a lot of pages have been already edited after last publishing, but not finished another editorial & approval round yet, to be ready to be pushed to the new public instance.
What was needed in this case was some solution to create new public instance, as close to the existing ones as possible, while pulling the data from author instance. Fortunately to all involved, Magnolia comes by default configured to version existing content upon publishing it, so we could use existing last versions for all previously activated content to recreate the public instance. This together with some magic to avoid changing the status of the content in author instance as done during normal publishing process, gave birth to synchronization module. Which as you see is just another way of restoring your public site in Magnolia, using authoring instance as a living source of backup data. There are limitations to this approach, like if you switch off the versioning feature on the author instance, module can synchronize only unmodified activated content. If you deleted piece of content, if can't be synchronized since it is completely gone (but this is fine, since deletions are always synched with public instance immediately).
Anyway the module was written, and tested and successfully used to create new public instances, that were already in sync with existing public instances and could be used as an instant replacement. Once all was done we asked ourselves a question - was it just one time job, or can we turn it into something more beneficial for everybody? Are there more use cases then the one we just used it for?
Here's my list of answers. If you are using Magnolia I would be interested to hear if you as a customer of the CMS see those as useful use cases for yourself, or if you think there are better ways to perform the tasks or (even better) if you see some more use cases that I missed.
- The public instance is usually in the open net environment making it potential target for kind of people trying to break everything. In case someone hacks your public instance in any way, you can use the synchronization module to get back to some reasonable state without resolving to emergency measures like activating everything you have on your author, even if it was modified or is not yet ready.
- Hw or sw failure causes you to loose instance. Might happen that for some reason your backup is not recoverable, this gives you another safety net.
- Due to sudden increase in popularity (e.g. slashdot effect) your public server is under heavy load and you want to add another public instance. Replicating public from public that is already under load is out of question same as republishing everything to this existing public together with new public instance.
- Variation of the above - your public instances are already moderately busy with current amount of visitors and you prepare launch of new product and expect spike in load and want to be ready for it. Again easier to push it from author then from public.
- You want to create simple replicas of your public instance to test variations of new design/theme for your site. Again using current public to do that is inadvisable if it is under moderate load and in general, as making any mistake there might render your current web inaccessible. It is much safer to do this from author instance.
- For technical reasons (hw upgrade) one of your public instances was out of business for while. In order to keep everything running, you have disabled or removed this instance from list of subscribers, but now to get it back into your pool of public instances, you need to re-synchronize it to ensure all the content published to other public instances in the mean time gets to this one as well. With XA you only ensure that content is on all or none of the active subscribers, it does not re-synchronize newly added (or re-enabled) subscribers with all the other instances.
- And of course, the usual ones:
- "because s**t happens" (TM)
- "because we can" (TM)