Fault Management
Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with
Required fields are marked with an *
or
Already have an account? Sign In »

Video Transcription
00:00
>> Let's switch gears a little
00:00
bit and start talking about what
00:00
happens when things go bad for good network admins.
00:00
Regardless of what incidents happen on the network,
00:00
all management and fault tolerance
00:00
is always going to be essential.
00:00
When we talk about fault management,
00:00
we are really talking about
00:00
being able to withstand failures
00:00
of one or multiple devices and
00:00
>> continue to move forward.
00:00
>> Do we have the redundancy in place?
00:00
Do we have the capabilities to keep
00:00
moving forward as an organization?
00:00
At first, when we're planning for redundancy,
00:00
we have to have some idea
00:00
>> of how much redundancy we need.
00:00
>> How much is enough?
00:00
There are a couple of terms,
00:00
a couple of metrics that may be helpful.
00:00
The first is a key performance indicator, KPI.
00:00
The key performance indicator will give
00:00
us an idea of whether or not a device is
00:00
performing to its expectations
00:00
because rarely does a device just stop working.
00:00
Often, the service starts to
00:00
degrade and we start to see some issues come up.
00:00
If a specific device isn't
00:00
meeting its standard performance,
00:00
that might be an indicator that we're looking
00:00
at a device that's getting ready to fail.
00:00
There are also metrics that the vendors give to us.
00:00
One is mean time to fail and
00:00
>> mean time between failures.
00:00
>> These are pretty comparable,
00:00
but the idea is mean time to fail
00:00
indicates this is a device that we don't repair.
00:00
This is the lifespan of the device.
00:00
We're going to buy it,
00:00
three years later it's going to fail.
00:00
Mean time between failures gives
00:00
the indication that the device can be repaired.
00:00
We buy the device,
00:00
it goes three years, it fails,
00:00
we repair it, three years later,
00:00
it fails, we repair it, and so on.
00:00
Then we've also got to consider how
00:00
long it takes us to repair the device,
00:00
which is mean time to repair.
00:00
If it's going to take us a long time
00:00
to repair the device,
00:00
is it possible that we can just replace it?
00:00
A lot of these, we don't really repair
00:00
a whole lot of devices today because
00:00
most of them can be replaced much cheaper
00:00
than the time and effort it would take to repair them.
00:00
The last are service level agreements.
00:00
Vendors are going to provide us with
00:00
commitments as to performance and availability.
00:00
A lot of times that revolves around up-time.
00:00
For instance, if I'm storing
00:00
my data with the Cloud service provider,
00:00
the redundancy is up to them
00:00
so that way they can meet their metrics.
00:00
With redundancy, we want
00:00
a redundancy to be comprehensive
00:00
and not just focus on
00:00
just redundant data or just redundant drives.
00:00
We really want to make sure that all of our areas are
00:00
redundant because a chain is
00:00
only as strong as its weakest link.
00:00
We start off by talking about redundancy in servers.
00:00
That's just multiple servers performing the same role.
00:00
This is not the same as a cluster
00:00
because with redundant servers,
00:00
each server is its own unique device.
00:00
I have domain controller A,
00:00
and domain controller B,
00:00
or DNS 1, DNS 2.
00:00
If one fails, the second is
00:00
up and running and is available.
00:00
A cluster. I have
00:00
multiple physical nodes acting
00:00
as a single logical entity.
00:00
They are very tightly coupled.
00:00
For instance, if I have a cluster or a server farm,
00:00
as you sometimes hear them,
00:00
you're not going to be able to differentiate
00:00
between the servers.
00:00
When I go to amazon.com,
00:00
there are many hundreds of
00:00
machines responding to web queries.
00:00
I don't know which one is which because they are
00:00
functioning as part of the same cluster.
00:00
Clusters also usually do provide load balancing,
00:00
but you can also have clusters that don't load balance.
00:00
They're just simply working for fault tolerance.
00:00
In a smaller environment,
00:00
I might have multiple servers.
00:00
They can either both be responding
00:00
to requests or one could be
00:00
passive while the other is
00:00
>> the primary responding server.
00:00
>> There are all sorts of configurations,
00:00
but the real purpose and the
00:00
>> real important piece is that
00:00
>> we need our fault tolerance for those servers.
00:00
That's where your services run.
00:00
That's where our resources are.
00:00
Many times we use spare equipment.
00:00
I have a Switch in the closet.
00:00
When we talk about cold spares, that's exactly it.
00:00
I've got a device somewhere in the closet.
00:00
>> I can find it.
00:00
>> I can install it.
00:00
>> We often have warm spares, which
00:00
>> you're ready to go very quickly.
00:00
>> Hot-swap or hot spares are
00:00
already installed and it's just a matter
00:00
of switching over to them.
00:00
Depending on the value of what's being protected,
00:00
how much downtime we can tolerate is going
00:00
to determine what types of spares we have.
Up Next
Similar Content