Welcome back in this video, we're gonna be all about business continuity, dealing with failure in the cloud, specific things you'll want to take into account and the particular strategies you want to employ to keep your systems up and running in the cloud
in the cloud you need. The architect for failure.
The thing is, single assets aren't as reliable in the cloud as they were in the traditional data center model.
The cloud providers environment is highly complicated. In virtualized, you often have many different tenants. If you can imagine, you have these virtual machines running on physical machines, and it's an environment that there's an expectation. Things were running 24 7 So in the cloud provider needs to make changes to a physical machine. They need to take actions to
divert and move virtual machines off of that physical machine and on the other machines within the cluster.
In early 2020 daredevil Nik Wallenda traversed a tightrope across Nicaragua's Masaya Volcano with lava boiling beneath him at 2300 degrees fair night, wind blowing and toxic gas in his face. He made his way across the rope, step by step.
You may have noticed in the picture that he had an emergency cord set up to catch him if he completely fell off the line, Additionally, had this very large balancing pole to adjust to the wind drafts that he was encountering during his trip. And then, finally, he had a special mask on to deal with that volcanic
in toxic gas coming up Adam,
because, as you can imagine, you pass out on a tightrope. It doesn't go well for you now, even with all those safety precautions myself, Could I do this? No way. Could you do this? I'm not sure unless you are a professional daredevil yourself. Likely not. Just as I am scared to walk across that kind of a tightrope,
Nick may be scared of the cloud. Fortunately,
the cloud is your profession. So be like Nick. But in the cloud, prepare for failure. If failure happens, your systems don't have to be running at 100%. In fact, that's usually not cost effective to do that, except for the most critical workloads. But assuming failure is the way to build reliable cloud native systems
before moving forward, let's define a few terms. Business continuity business constitute E planning.
This is a playbook to address large scale failures, talking about buildings, collapsing or being untenable. We're talking about electricity outages, maybe for short periods of time, and you kick on a backup trended greater maybe for real long periods of time. We're talking about natural disasters. I live in California. Everybody talks about earthquakes out here.
A few years back, the country of Puerto Rico encountered a massive hurricane,
and the company I was working for had a large presence there. So we really had to do a lot of disaster recovery in its most extreme and a new area where we're just figuring out how that will work are pandemics, large scale virus or bacterial outbreaks? The gold business continuity planning
is to get people and processes
critical for business working with an acceptable amount of time. But let's talk about disaster recovery that's similar but a little bit different. Disaster recovery is more the tactical plan that you're going to use to restore the technology systems that are critical to those key people and processes of your business.
When you consider backup strategy and it's both basic form comes down to three major methods, you have the hot backup strategy. This is the highest cost, but it has the least downtime. This is where you have hardware, software, data, people, everything ready to cut over to this new location at a moments notice.
And it's kind of tricky to manage in the sense of
replicating data in the sense of ensuring there is enough capacity at the fail oversight.
If you're managing all of this yourself in it, in the traditional data center model, it was very difficult to manage something like this. You had to have a separate facilities. You had to make sure that you had not only data replicating, but you had enough hardware at both locations, often hardware of similar models and similar nature's to make sure that everything
when a would fail over would operate at a reasonable level.
I'm going down the chain. We have the warm site. This is a compromise between hot and cold. You don't have everything running at an alternative site up and ready to just fall over and go. But you do have the necessary servers, applications, operating systems and even some ongoing data replication. So if things do
south and your primary region goes down with some manual or automated effort, but with some level of effort. And in a little bit of time you can get that new location up and running and hosting the roll off the region that went down
and then last we have is the cold. This is the lowest cost, but all the lowest downside
in his most extreme situation. You have, ah, data center room with some racks and Internet connections may be sitting there necessary electricity and cooling. But you haven't even installed the servers. Often times companies will still have a cold, but they don't want to quite be that that cold.
They'll have a handful of servers sitting there.
They have virtualization technologies. But there's really no active efforts underway to be replicating any of the servers and applications that are going to be assumed by this cold site when fail over does happen, and it's not until the main
site falls over. That's when the efforts there start made to maybe start restoring backups
from a certain location and rebuilding all of those machines, restoring the data from other site cross site backups and so forth. So it just takes a lot longer in traditional I t. When you manage the data centers,
any of these strategies really provided a lot of overhead hot being highest, obviously. But in the cloud you have the resiliency tradeoff that still exists in terms of hot being more expensive than cold. But the cost to achieve even the most basic of these strategies is significantly less,
and we're gonna talk about how you achieve these different things.
But before we dive into those specifics, it's worth noting that not all systems are equal on, and you want to focus your efforts on those that are the most critical business impact assessment. A. B I. A. Is a questionnaire based tool that you can create to capture and set expectations for each system within your business.
Different companies make these. I'm sure you could even find a simple one off if you were to just type a little Google search string. And what it does is it allows you to justify the high costs associated with those systems that really our system critical and demand resiliency.
The recovery time objective is the amount of time between when a system goes down and it needs to be back up and running after a disaster, and it's really Altoona. Avoid unacceptable consequences associated with a break in business continuity.
So Rto is the answer to the question. How much time did it take to recover after notification of business process disruption?
A close cousin to the RTO is the recovery point objective. The R P L.
This is where you define the amount of data that you can lose in the event of some sort of a disaster. So let's say you are doing some sort of, ah data backup procedure Dad, backing up all databases every night and then taking those backups offsite or using the cloud providers functions and having the backups
moved to a different region. In that circumstance, it would be just fine as long as the R. P. O
is 24 hours, because you could lose upto one day peas worth of data. But for some systems, that's not acceptable in the event of a disaster, there's a lot less data that they're willing toe lose. And so in those circumstances you're going to need to have a much shorter RP Oh, and the most critical traffic sites
can even reach down to five minute r P o expectations
in a SAS model. You don't have his active auroral controlling the technical implementation regarding art, EOS and our Pios. However, we talked about the key tool of the contract. Thes contractual agreements are an awesome place to put rto and R P o expectations as well as remedies in the case that the SAS provider
fails to meet those expectations.
The geo redundancy capability of cloud providers are extremely powerful, especially when we're talking about disaster recovery.
In the next video, we will continue this discussion and look at different mechanisms cloud providers give that allow you to achieve the necessary disaster recovery. We'll also talk about the reality of cloud provider outages as well as options for portability across different cloud providers.