Fault Management

Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with
or

Already have an account? Sign In »

Time
8 hours 19 minutes
Difficulty
Beginner
CEU/CPE
8
Video Transcription
00:00
let's switch gears a little bit and start talking about what happens when things go bad for good. Network admins.
00:07
Regardless of what incidents happen on the network, all management and fault tolerance is always going to be essential.
00:13
When we talk about fault management, we're really talking about being able to withstand failures of one or multiple devices and continue to move forward.
00:21
Do we have the redundancy in place? Do we have the capabilities to keep moving forward as an organization?
00:27
At first, when we're planning for redundancy, we have to have some idea of how much redundancy we need.
00:33
How much is enough?
00:34
There are a couple of terms, a couple of metrics that might be helpful.
00:38
The first is a key performance indicator.
00:41
K P.
00:42
The key performance indicator will give us an idea of whether or not a device is performing to its expectations, because rarely does the device just stop working.
00:52
Often service starts to agreed, and we start to see some issues come up
00:56
of a specific device isn't meaning. It's standard performance. That might be an indicator that we're looking at a device that's getting ready to fail.
01:03
There are also metrics that the vendors give to us.
01:07
One is meantime, to fail in mean time between failures.
01:11
These are pretty comparable. But the idea is meantime, to fail indicates this is a device that we don't repair.
01:19
This is the lifespan of the device. We're going to buy it. Three years later, it's going to fail.
01:23
Mean time between failures gives the indication that the device can be repaired.
01:29
We buy the device, it goes three years. It fails, we prepare it. Three years later, it fails, we repair it, and so on.
01:37
Then we've also got to consider how long it takes us to prepare the device, which is mean time to repair.
01:42
If it's going to take us a long time to repair that device, is it possible that we can just replace it?
01:49
A lot of these we don't really repair a whole lot of devices today because most of them can be replaced much cheaper than the time and effort it would take to prepare them.
01:57
The last our service level agreements,
02:00
vendors are going to provide us with commitments as to performance and availability.
02:05
A lot of times that revolves around up time.
02:07
For instance, if I'm storing my data with the cloud service provider. The redundancy is up to them, so that way they can meet their metrics
02:16
with redundancy. We want a redundancy to be comprehensive and not just focus on just redundant data or just redundant drives.
02:24
We really want to make sure that all of our areas are redundant because the chain is only as strong as the weakest link.
02:30
We start off by talking about redundancy and servers.
02:34
That's just multiple servers performing the same role.
02:37
This is not the same as a cluster because with redundant servers,
02:40
each server is its own unique device.
02:44
I have domain controller A and domain controller B or a DNS one DNS, too.
02:50
If one fails, the second is up and running and is available
02:55
A cluster. I have multiple physical nodes acting as a single logical entity.
03:00
They are very tightly coupled.
03:02
For instance, if I have a cluster or a server farm, as you sometimes hear them, you're not going to be able to different entry between the servers.
03:10
When I go to amazon dot com, there are many hundreds or machines responding to Web series.
03:16
I don't know which one is which, because they are functioning as part of the same cluster
03:21
clusters also usually do provide load balancing.
03:23
But you can also have clusters that don't load balance.
03:27
They're just simply working for fault tolerance
03:30
in a smaller environment. I might have multiple servers.
03:34
They could either both be responding to requests, or one could be passive, while the other is the primary responding server.
03:39
There are all sorts of configurations. The real purpose and the real important piece is that we need our fault tolerance for those servers.
03:47
That's where services run.
03:49
That's where our resources are.
03:51
Many times we spare equipment.
03:53
I was switching the closet and we talk about cold spares. That's exactly it.
03:58
I got a device somewhere in the closet.
04:00
I can find it. I can stall it.
04:03
We often have warm spares, which are ready to go very quickly.
04:08
Hotspots or hot spares are already installed in just matter, switching over to them,
04:13
depending on the value of what's being protected. How much downtime we can tolerate is going to determine what types of spares we have
Up Next