let's switch gears a little bit and start talking about what happens when things go bad for good. Network admins.
Regardless of what incidents happen on the network, all management and fault tolerance is always going to be essential.
When we talk about fault management, we're really talking about being able to withstand failures of one or multiple devices and continue to move forward.
Do we have the redundancy in place? Do we have the capabilities to keep moving forward as an organization?
At first, when we're planning for redundancy, we have to have some idea of how much redundancy we need.
There are a couple of terms, a couple of metrics that might be helpful.
The first is a key performance indicator.
The key performance indicator will give us an idea of whether or not a device is performing to its expectations, because rarely does the device just stop working.
Often service starts to agreed, and we start to see some issues come up
of a specific device isn't meaning. It's standard performance. That might be an indicator that we're looking at a device that's getting ready to fail.
There are also metrics that the vendors give to us.
One is meantime, to fail in mean time between failures.
These are pretty comparable. But the idea is meantime, to fail indicates this is a device that we don't repair.
This is the lifespan of the device. We're going to buy it. Three years later, it's going to fail.
Mean time between failures gives the indication that the device can be repaired.
We buy the device, it goes three years. It fails, we prepare it. Three years later, it fails, we repair it, and so on.
Then we've also got to consider how long it takes us to prepare the device, which is mean time to repair.
If it's going to take us a long time to repair that device, is it possible that we can just replace it?
A lot of these we don't really repair a whole lot of devices today because most of them can be replaced much cheaper than the time and effort it would take to prepare them.
The last our service level agreements,
vendors are going to provide us with commitments as to performance and availability.
A lot of times that revolves around up time.
For instance, if I'm storing my data with the cloud service provider. The redundancy is up to them, so that way they can meet their metrics
with redundancy. We want a redundancy to be comprehensive and not just focus on just redundant data or just redundant drives.
We really want to make sure that all of our areas are redundant because the chain is only as strong as the weakest link.
We start off by talking about redundancy and servers.
That's just multiple servers performing the same role.
This is not the same as a cluster because with redundant servers,
each server is its own unique device.
I have domain controller A and domain controller B or a DNS one DNS, too.
If one fails, the second is up and running and is available
A cluster. I have multiple physical nodes acting as a single logical entity.
They are very tightly coupled.
For instance, if I have a cluster or a server farm, as you sometimes hear them, you're not going to be able to different entry between the servers.
When I go to amazon dot com, there are many hundreds or machines responding to Web series.
I don't know which one is which, because they are functioning as part of the same cluster
clusters also usually do provide load balancing.
But you can also have clusters that don't load balance.
They're just simply working for fault tolerance
in a smaller environment. I might have multiple servers.
They could either both be responding to requests, or one could be passive, while the other is the primary responding server.
There are all sorts of configurations. The real purpose and the real important piece is that we need our fault tolerance for those servers.
That's where services run.
That's where our resources are.
Many times we spare equipment.
I was switching the closet and we talk about cold spares. That's exactly it.
I got a device somewhere in the closet.
I can find it. I can stall it.
We often have warm spares, which are ready to go very quickly.
Hotspots or hot spares are already installed in just matter, switching over to them,
depending on the value of what's being protected. How much downtime we can tolerate is going to determine what types of spares we have