The Disaster Recovery Conundrum
By John Parkinson
If you're like me, you spend a decent amount of time worrying about whether your disaster recovery processes and plans would really work.
Sure, you actually do have plans. Sure, you practice parts of them periodically. But a full "like-it-was-for-real" DR test isn't usually feasible unless you run active/active with rapid reconfiguration/failover capabilities, and most people don't do that.
When you look at the people who do (I know a few) you realize why most organizations don't even try: It's a LOT of work.
Looked at subjectively, DR is an essential component of a rational operational continuity plan for the business. Customers expect you to have this--and might not be willing to do business with you if you don't. It's the kind of thing that the board expects executive management to be on top of. It makes everyone feel better that someone is thinking about the problem.
Almost everyone I know in IT (including me) knows of or has experienced some flavor of "disaster" at least once in their career. So a DR plan seems like a prudent part of business management. An entire industry has grown up around responding to the needs of various DR scenarios. And disasters do happen.
Look at it objectively, however, which requires us to look at the real probability of a disaster that's severe enough to destroy all of our IT capacity (or at least render enough of it inoperable for long enough that the business would be unable to function), and you might get a different picture.
Do some high availability engineering and take a few simple precautions to "harden" your infrastructure and processes, and the range of disasters that can take you out entirely becomes much reduced. That makes you "disruption resistant" rather than disaster proof--because disruptions are much more likely than disasters to actually happen.
Then eliminate the disasters that you wouldn't survive as a business no matter what your IT DR capability was. What's left is usually a pretty short list of very low probability events. Which in aggregate aren't worth providing against--at least not at the level of expense that would be required to guarantee you would actually survive.
Now I accept that "Black Swans" do happen. We can all list a few from the past decade. But for many businesses it would be cheaper to buy catastrophe insurance and bank the check if the worst happens than to spend the significant amounts necessary to provide for a potentially problematic recovery capability.
I also accept that if you work in safety critical or economically critical or a few other "critical' areas, this flavor of thinking doesn't apply to you. But for everyone else, we have to ask ourselves, "Is it actually worth it"? After all, the data indicates that quite a lot of people who declare a disaster and move to their DR site never come back.
Which brings me to my final rant, aimed at some of the "critical" folks.
Too often I see DR plans that just wouldn't work in a real disaster because they depend on things that probably won't go right. In real disasters, thinks always go wrong. People aren't available; their backups aren't available. The documentation at the DR site isn't quite up to date. Confusion is rampant, maybe even panic. Communications are overloaded. Key people (and keys even) go missing just when you need them. Passwords have just expired...
Practically no one practices these things--they're too disruptive and expensive, with real danger of a business interruption if (when) things go wrong. So organizations spend a lot of money on something that would never actually work--or would require heroic efforts and levels of luck that can't be guaranteed.
We should either spend what it takes to ensure operational connectivity (which means at least an active hot site in a distant location with redundant connections and periodic switches between sites so we know it will work every time) or admit we can't recover, buy some catastrophe insurance and get on with life--and focus on the things we can actually make work.
John Parkinson, the former CTO of TransUnion LLC, has been a technology executive and consultant for over 30 years, advising many of the world's leading companies on the issues associated with the effective use of IT. Click here to read his columns in CIO Insight's print edition.