Putting the 'D' Back into Disaster Recovery Plans
By John Parkinson
Over the past 25 years I must have reviewed over 200 Business Continuity and Disaster Recovery Plans for IT departments of all sizes and companies in all kinds of businesses. I would guess that about half of them were either fundamentally unworkable, unaligned with the economics of the business or largely unnecessary.
These plans had several things in common, all of which contributed to that judgment:
- There was no business impact analysis (BIA). When I do a DR Plan (and I'm not an expert, just a practitioner of applied common sense) I start with assessing what I want to protect (generally: people, revenue, reputation, assets - in that order). Not all revenue is worth protecting - why would you spend an extra $2 million a year to protect the last $1 million of revenue? But that's not usually how the plans operate. In one extreme case, I pointed out that it would be better to buy business continuity insurance to cover three years of anticipated profits and if the worst happened, cash the check, pay off the owners and go do something else. Economically that made much better sense than attempting to survive anything that might happen. A business impact analysis lets you make these choices in a rational way.
- There's no point in planning to be the only survivor. A lot of plans cover eventualities that would wipe out their suppliers or customers or both.
- Not all disasters have equal impact. There's a world of difference between losing a SAN disk shelf, losing the data center or network or losing an entire building from which everyone accesses their technology. Yet many plans respond essentially the same way to every disaster.
- There is incomplete understanding of the complex dependencies between infrastructure and applications and amongst the applications themselves. The more heterogeneous the environment, the worse this is. And you'd be surprised how much essential information lives on PCs and laptops that isn't replicated and might not be accessible in a major disaster.
- There was no plan to come back, even though there was a presumption that there would be a place to come back to. And coming back is hard - maybe harder than leaving in the first place.
- The DR plan didn't match the current operational environment in critical ways. Production environments change every day in a myriad of ways, but how do you know that these changes are reflected in your DR plan? My experience tells me they too often won't be.
I could go on. Fundamentally, however, these were IT asset protection plans - ways for IT to survive, even if the business doesn't. Which clearly makes no sense. And, in many cases, there was no way to actually test the plan in total - too complex, too expensive, too disruptive to everyone's day job.
Yet most real disasters are much less well-structured than a test - so if you can't make the test work when you can plan for it in advance and stage everything just right, what chance will you have if the big one hits?
One way to get a workable DR plan (there are several options) is to do some up-front scenario analysis after the BIA is done and build up a set of layered responses to incidents of increasing severity. For the least serious impacts you can engineer high availability solutions - essentially disaster avoidance strategies. For disasters you can't avoid, you can build routine operational processes (things like rolling cluster upgrades, managed application failover, deliberate load shifting) that let you practice for a real problem, so your people are familiar with most of the work they'll need to do in a disaster. That will also exercise most of the technologies you'll need and ensure they're working reliably - and that the disaster won't be their first use.
One final thought: Few companies exist in isolation. They'll have connections to suppliers and service providers, maybe to customers. In a "my-DR" centric plan, you'll have ways to maintain these connections.
Now, hold up the mirror: You're in everyone else's DR plans to (or you should be). So, if several of your partners are caught up in the same disaster, can your DR plan work with theirs? Can your DR site connect to their DR sites? How do you test that scenario?
And let's not forget that, first and foremost, we have to protect our people. About the Author John Parkinson is the head of the Global Program Management Office at AXIS Capital. He has been a technology executive, strategist, consultant and author for 25 years.