9 hours 59 minutes
in this video, we're going to pick up right where we left off, talking about business continuity and disaster recovery in the cloud. We're gonna look at some of the strategies and tactics for doing this, specifically focusing on continuity and recovery within a given cloud provider
preparing for in managing cloud provider outages. And then we'll talk about other options for portability.
In the first domain, we discussed a logical stack, which you see on the right hand side of the slide.
It provides a good way to structure our discussion around the continuity within the cloud provider,
and we'll start out by looking at the things you can do it, a meta structure layer
first and foremost, you want to be sure you can back up your cloud configurations. This includes how you've configured identity access management and other policies that you've created, which apply to your cloud account regardless of the region of the resource is
this meta information is frequently replicated to numerous domains within a cloud provider, so it's not likely you will lose those cloud configurations if a specific region goes down. But when all is said and done,
an outage has a negative impact on your business, and you want to make sure you've taken every precaution possible to minimize that impact.
When you're using a cloud providers I as or pass capabilities, you also can take advantage of software defined infrastructure.
This is often referred to his infrastructure as code. In this paradigm, you declare your cloud resource is how they're configured and you do it all giving you air quotes here in code.
We'll be talking more about infrastructure as code in future sections, and it brings a lot of advantages. But the important point for this discussion is the ability to programmatically regenerate your eye *** and pass cloud environment in a different region of the cloud provider. It could take some time to regenerate large amounts of infrastructure,
and that rules out this approach for hot recovery strategy. Depending on your RTO, it could work for a warm recovery strategy. It is definitely a good cold recovery strategy. These tactics are more difficult with the SAS model because you have less insight into the details and as is the case with so many other things in the SAS model,
you want to rely heavily on contractual obligations with your SAS provider
Moving down the stack. Let's talk about infrastructure.
One of the cool things about cloud providers is the geo redundant capabilities that they provide. It makes it very easy for you to set up things so that if one region of the cloud provider fails, you can move over to another region.
However, the specific ways you configure, the different I *** and past services of a cloud provider are gonna vary. And you need to spend some extra time understanding how they work. Even need to adjust the architecture of your application to leverage this provider resiliency.
And when you're configuring this redundancy, be very considerate of cost relative to the risk of the system outage. It's not something you're gonna wanna blindly do for all assets that you have running in the cloud as the costs can definitely add up.
The infrastructure layer is all about the data, and in that category, cloud providers also have great methods to replicate data across different regions.
The cloud provider mechanisms to replicate data between regions can make it much quicker, more secure and cost effective than if you were to attempt to roll your own method of data replication
when creating storage backups and different regions. Also be mindful of storage tiers that cloud providers have. For example, they will have hot tear, which is usually the most expensive cold, tear, archive tear and maybe other tears. It's often determined by how frequently you're going to access the data.
The less frequently you need to access the data,
the cheaper the storage tear would be.
And in many situation, it makes sense to have data replication go to a different region, where you are using a much lower tier for the data access.
This way, you're not paying premium prices to have data. Just sit there, and only when the fail over happens do you increase the storage tear.
This is a great mechanism for handling warm, fail over scenarios and even cold fail over scenarios.
And at the bottom of it all, we have the apple structure layer. This is where you're going to be using platform as a service components quite a bit. The specifics of each service will vary by cloud provider, and within each cloud provider, different platform services may have other limitations and lock ins that you need to be aware of, but in many circumstances,
the past services
allow your application to fail over gracefully. And when you're really serious about disaster recovery, you will apply the concept of chaos. Engineering.
You may not have heard of chaos engineering before
this was popularized by Netflix and their use of chaos. Monkey Chaos Monkey would run in the production environment, and it would kill random services. This allowed them to continually be testing for resiliency and disaster recovery. Now, it's very bold to be doing this in a production environment, but it does bring some benefits.
The Netflix developers and engineers had to assume failure because they know even if they're what they wrote was stable
chaos Monkey may come along and knock it over, and as a result, it really required failure be considered an address as part of the software design itself. It wasn't an afterthought. In a way, you can consider chaos engineering as continuous resiliency, testing chaos. Engineering is not for the faint of heart,
but for those who are determined to create highly resilient cloud native systems,
it's definitely something you want to add to your tool kit
so that we've talked about some of the strategies and tactics. Let's get to the reality. A cloud provider does not wanna have outages, but you do need to prepare in advance for an outage.
This requires that you decide on the acceptable levels of risk for those outages and the services that are most critical to your business. It's also valuable to share your disaster recovery plans with your provider, especially when you have large workloads. Imagine if your disaster recovery for an outage in the western U. S region
was to fail over and move everything into the central U. S. Region.
The cloud provider is going to have a good overview of the capacity of that region. So we have a very large workload in the West U. S region, and there's a whole lot of other people who have their own fail over plan to move into the central U. S. Region, and that central US region does not have the capacity to accommodate all those workloads. You may be setting yourself up for failure,
and by sharing this with your provider,
they may provide you the option to reserve capacity in that fail over region or take other steps as part of your disaster recovery planning. Ultimately, it's to guarantee that you have space in the target region that you're failing over to
all of the disaster. Recovery strategies and tactics we've talked about so far are focused on cross region within the same provider.
And if you really, really compelled to think, what if the cloud provider as a whole goes down, that's when you can start thinking about cross provider disaster recovery.
This is not a cheap undertaking. It is not easy. There are gonna be lots of particulars with one cloud provider that just don't map directly to another cloud provider. For example, Amazon has their elastic container services, and that's very specific to Amazon.
It's definitely more complicated to manage a kubernetes cluster than to take advantage of Amazon's Laster container services. However, if you really wanna have that cross provider capabilities, you need to take the abstraction up another level and avoid lock in by staying away of providers platform as a service
and thereby ensuring a little more portability and what you're creating and moving
once again. This really is not recommended, and it should only be for the most critical workloads because it takes a lot of planning It takes a lot of expertise across the different cloud providers, and knowing what's gonna work in one place may not work with the other cloud provider.
On a final note, let's talk about the community and private clouds. In this circumstance, the cost of disaster recovery is comparable to traditional data center model. In the private cloud, you are responsible for the physical layer for the machines, the premises all the way up in the traditional data center Mahler. You're responsible for the physical facilities all the way up
and the same rings true in the private cloud circumstance.
And when your private cloud or community cloud is being managed by 1/3 party, be aware of contractual obligations that you have with them. You might even want to modify that contract. Just the offload, the effort that it will take to design and orchestrate these kind of disaster recoveries and have the third party take care of that for you.
And as a final point, consider any geography based obligations when determining your fail over location. If you have a private cloud or community cloud and there's certain information there that needs to remain within a particular state, or country, and your disaster recovery plan has you moving into a completely different country
that could put you in a really bad spot, failing to meet those obligations.
So if we were to summarize key points about business continuity in the cloud
architect for failure for sure, individual resource is air less resilience. But the providers themselves have lots of capabilities to handle, fail over within a particular region or cross regions. Consider disaster recovery for the scope of specific application components or systems.
Think through situations where the cloud providers region as a whole goes down,
and then you can even contemplate total cloud provider outages.
Disaster recovery planning covers the entire stack medicine structure, infrastructure info structure in Apple, a structure
be sure to leverage the cloud provider capabilities for I as and pass fail over across geography is there's really will make your life a lot easier and will hopefully give you the confidence so you don't have to open up the book on having cross provider fail over. But last but not least and just as important, take cost into account
and prioritize the critical services. First.
I've been working in I t. For many years. I totally get it. Every user thinks every system should just be up all the time, but that's just not economically feasible. The cost of high availability goes up significantly, the more nine to tack on to it. And in some circumstances, the cloud providers themselves can't even meet
the expectations that your end users have.
But that's why we employ the business impact assessment process to objectively evaluate the service, how critical it is to the business and conveyed to the business users. Here's where you sit and here's the art EOS and our pose that we can achieve within reasonable cost bounce.
And that wraps up our deep dive into business continuity. It also closes out our discussion around the control plane.
Please continue to the next video is we do a knowledge check on the many concepts learned in these last few videos