Understanding the Disaster Recovery TLAs

Understanding the Disaster Recovery TLAs – DR, BCP, RPO, RTO – too many TLAs?

Intro

I recall when I first started my previous engagement I was joining a huge global organization through an acquisition and I was constantly being bombarded with acronyms—Three Letter Acronyms (TLAs). I was drinking from the fire hose, confused and struggling to keep up. I imagine some of you out there are having the same challenges when looking at data protection, disaster recovery, or a business continuity plan for the first time, or even revisiting. Hopefully what I have to say will make things a little less scary for you.

In a previous life working for a bank I was instrumental in designing and setting up our Business Continuity Plan (BCP). By far the overall biggest challenge was bringing together the business units and the IT department to classify applications that needed protection as well as applications that were essential for continued business. Finalizing the list of applications, and their relative priority took an inordinate amount of time.

Together we had to decide the best approach given what infrastructure we had at the time, the cost, and the complexity of any potential solutions. This was a heterogeneous environment with a mix of virtualized and bare metal servers including HPUX and AS400.

Cost Deriving Decisions to be made

So, what are some of the factors that drive these decisions, especially in today’s fast paced, time sensitive world?

RPO and RTO drive a lot of the project decisions. Recover Point Objective (RPO) tells you how much data you can sustain losing, whereas Recovery Time Objective (RTO) tells you how quickly you need to be up and running. However, understanding RPO and RTO may not be so straightforward, and getting consensus even harder.

The CEO, or business head, may say, “I don’t want to lose any data and I don’t want services to be unavailable at all”. The CEO may not be aware what implications that statement has on the budget dollars required to make that happen with seemingly no cost recovery over time.

Inversely the CTO will have a better grasp on what is achievable given his budget, and will need to collaborate with the business to set expectations for priorities, and recovery times.

DR is an insurance policy, the ability to continue to do business following an internally forced issue or an externally forced issue necessitating a ‘failover’.

Timely Decisions

Application failover can be instant, near instant, take an hour, it can take a day—it will all depend on your budget and the type of business you are involved in. For the majority of cases the smaller the RPO / RTO the more it is going to cost you.

Consider the following scenarios:

Intra-Site Failover

A failover may occur with another system in the same data center or building for a highly available application or solution (manufacturing is a good example) leveraging an active / active data center scenario.

Inter-Site Failover

It can also be between data centers in different locations due to external threats (flood plains / earthquake zones / airport proximity) yet still be active / active (a banking system is a good example).

Both these solutions typically require the same specifications at each location to give like performance regardless of which data center the application is being hosted in. A lot of the times the financial cost makes this option prohibitive.

Link Requirements and Considerations

Geographical distance between locations may impact ability to have a zero RPO / RTO due to the pure physics of being able to synchronize data between two sources. Not because it cannot be done, but because the latency involved renders the application too slow or unusable.

Creating a low latency link between geographically disparate locations for today’s volumes of data is never inexpensive. Everything is getting bigger—drive sizes, data sizes, meanwhile our hunger for instant responses and easy operations are as ravenous as ever. This translates to higher bandwidth low latency links, and today low latency means sub 5ms return trip types of latency.

Don’t despair though, for businesses that don’t need or are not in the market for the Rolls Royce solution there are other options. Alternate approaches with lower RPO / RTO expectations can reduce OPEX costs and CAPEX costs for your DR project.

Next Time

In the next episode we look at how other organizations establish their DR / BCP projects. These projects are living breathing things that need monitoring, nurturing, and updating over time to adapt to the changes in your business processes, and the technology underlying them.

Please comment and leave feedback, and tune in for episode 2 very soon.

Thanks for reading,

Twister

@twister68

Caveat

I work in the storage industry, and whilst the overall nature of this series of blogs is to help simplify the process of looking at DR, I am speaking initially from a data specific stance. There are a lot of application and networking layer considerations that will not be addressed in these blog episodes. I will build on this blog based on feedback and comments.

I am English living in Canada, but as we are really polite here I have tried to stick to U.S. style grammar and spelling where possible. Excuse any lapses in punctuation J