Service Design KPIs

Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with

Already have an account? Sign In »

3 hours 49 minutes
Video Transcription
Welcome to the idol framework. Updated course from Cyber Eri I t.
My name is Daniel Riley. I'm your subject matter expert on the idol framework.
In this video, I'm going to discuss the KP eyes used for service, design and monitoring.
So the 1st 1 we want to talk about is the percent of service is covered by a service level agreement. This is ah, pretty straightforward. One we're going to take the number of service is that we've implemented which have a service level agreement covering them.
And then we're gonna divide that by the total number of service is that we have
running available.
Now, this is a constraint percentage between zero and one.
where we're aiming to get as close to 100% coverage or a value of one point OAS we could get
the next one is the percent of service is with underlying,
uh, contracts underpinning contracts or operational level agreements. These typically come after
service level agreements and will not be necessary for all service is,
but we're going to take the count of service. Is that do have these contracts? We're going to divide that by the total number of service Is that air using them? And again, we'll come up with a constrained percentage between zero and one where ideally, we would want 100% coverage.
We're gonna take account of the service level agreements which were reviewing, um
and then we're going to take a percentage of our fulfilled S L A's, um,
is a count of all the service level agreements that we've met divided by the total number of service level agreements that we made.
Ah, and this is a constrained metric again. We're looking for a percentage closer to 100%.
and in this one N e s L A's that are out there that are not fulfilled would definitely want to be coming under review.
So here's kind of ah graphic of the relationship of
the different service level states that it can be in the outer ring is all of the service level agreement states. That's all of the total service is,
inside that we have the covered, which is the blue circle, the slightly larger one. Um, these air service is which have sl a agreements than the slightly smaller green circle. These air service is which have all of their S L A agreements fulfilled.
Now the purple circle is service's with ol ays and underpinned contracts. And finally, the yellow circle at the bottom is service's, which are under review.
Now. If I remove the colors from these, you can see they start to form Ah, typical Venn diagram
and the overlapping sections of the circles defying service's, which have multiple ah
features that we can describe. For instance, this area describe Service's, which have all of their service level agreements and their underlying underpinning contracts fulfilled.
So these air really good service is, however, there outside the under review circle. So we're not currently looking into these.
This section here is all of those which have fulfilled all of their requirements and have both under pending contracts and S. L. A's, and they are currently under review. So we're probably looking into these as models for future service is
these are our success stories.
This section, however, our service is that are covered by certain S. L. A's
and ol ays.
However, they have not fulfilled all of their service level agreements. These are trouble areas, and they are inside the under review circles. So these air the trouble areas that we're currently aware of and trying to improve?
No. Ideally,
we would like all of these circles, um,
to be totally filling the total circle.
And it would be very hard to see any relationships because there wouldn't be any unfulfilled and very little need to be reviewing, so that wouldn't make is good a diagram. So ultimately, we want these circles to be roughly the same size
and completely encompassing each other
for capacity management. We're gonna want to keep account of the capacity shortages. Now, these air the number of times that an automatic
or manual process was triggered to deal with demand on a service that we weren't capable of handling initially,
where I don't want to keep account of the deviation from our expectation. What this is is essentially for different types of capacity. You will have set your require ir ger expectation of your requirements, and over time, you'll measure your actual capacity requirements.
Um and then expectation minus actual,
I will give you your utilization score. Now, if you get a number of less than zero, that means you're over utilizing that service on a regular basis, and you need to up your capacity expectations accordingly. Zero means you've hit your target. Exactly. Congratulations.
Anything over
the zero is an underutilization, and that means that you had some percentage of your capacity in reserve.
I'm This is typically the case, and generally the one I would recommend aiming for,
um rather than being exactly on target, keep a little bit in reserve,
we're gonna want to keep account of the capacity adjustments that we've had to make. And this is very similar to the number of capacity shortages. Um,
we're gonna take a percent of the unplanned capacity adjustments. Which of the total capacity adjustments Some percentage of the may have had to be made ad hoc due to emergency capacity demands.
And to find that we're gonna take the count of unplanned adjustments and we're going to divide it by our count of capacity. Adjustments in hole
Another's will give us a constrained percentage between zero and one, where in this case, we actually want a lower number. We would have closer to zero unplanned adjustments, if possible
Now. Meantime, to shortage recovery is a way of describing the lifetime of capacity management shortages and supposing that you had several events that you want to monitor in terms of shortages, those events will happen.
A system will go down or a demand will rise
and then some amount of time will pass before you can recover and operationally
service those requests for each time this happens, you're going to take the time it takes you to recover and you're going to addle of those times together and then you're going to divide it by the number of times that you had capacity shortages.
Um, now this is an unconstrained number mathematically, but it is one of the things that is often referenced in the service level agreements,
and therefore you aren't constrained to make sure that you can meet your promised goals of recovering in times.
So we've got some things that we should talk about in capacity management because there have been misconceptions on the way that the cloud has affected capacity management for organizations.
Um, first off, the auto scaling features that we all have come to know and love today remove the manual work, but not the need to measure the environment. And I have seen organizations
put monitoring by the wayside simply because they wrote it in the code and their code is always going to work, and we all know that that's not correct.
So understanding what is happening with our capacity demands in our environment overall will allow us to do better horizontal scaling and even in a cloud environment where the scaling can be automated very easily. We really wanted
understand and analyze what it is
so that we can optimize.
So instead of removing KP eyes, actually wanna add to KP eyes, particularly for capacity management and cloud environments on. And that is the percent of capacity and reserve. When the utilization is that it's average or mean utilization on, there are two ways you can think about the mean
utilization of a service. The 1st 1
is to take the count of the reserve that you have of your service.
when the service is added to mean utilization, you can think of this as computational. How if your computer is sitting doing its average amount of work on a day, we would take that
amount of computational resource is and then we divide it by the day. The number of days that we watched on and this would give us an average amount of reserves that we have in our resource is,
um, another way or oftentimes, a simpler number to get at,
um is to look at this as the resource is reserved at time. X Time X has figured out by the average number of users using a service.
So you look for
a time when
you're not at your maximum usage you're not. You don't have the maximum number of users, and you don't have no users or the minimal number, but a nice middle ground. And you look at your reserve resource is at that time.
oftentimes, this is an easier way to calculate it. They both give you slightly different reviews.
Uh, but they can both be used to help inform you of what you have on hand. On average,
the next one is a related
cape guys, the percent of capacity and reserved when your utilization is that Max
again, this is measured in the same way on Lee instead of looking at the resource is when your service is that that's mean utilization. We're gonna find the maximum point of utilization anywheres. Look at the reserves we have at that point,
we're going to use this to inform our budgeting based on the use of the auto scaling we need to we can. We can learn to optimize
how much we need to keep in reserve s. So if we are over utilizing the service, we may need to make a budget request to get more resource is on hand. Or if we're under utilizing a service, perhaps we can tweak our auto scaling that's being more efficient.
Security is going to use our capacity management information to look for malicious use and oftentimes, spikes in capacity usage
can be indicators of certain types of security risks. Such a denial of service attacks or also just bad guys storing bad stuff in your corporate network where you don't want them on operations is going to use this information, Of course, too,
inform their operational capacity adjustments
and to plan for future capacity needs.
So when we talk about on site capacity planning this often time comes in some form of server or VM server automation.
and when we talk about off site, this is in the cloud and out of our control. These often times come in the form of automatic and Stan she ation of instances,
um, or micro service clusters that you can use specifically on. And then they'll shut down when their utilization drops off.
Now, as I mentioned for security, we're going to use this capacity manage now planned for attacks like denial of service attacks, the specifics of which we won't get into in this class. But it's important to understand that in our capacity management planning,
we need to be aware of these types of attacks, and we have to have a plan to deal with. Um,
another are a lot of third party service's, and one that just comes to mind is Cloudflare. They can help with capacity management attacks or toe help mitigate those types of attacks on your capacity.
Now, the first k p I we're gonna look at for availability management is the counter service interruptions and these air going to be due to technical failings on our part, too, and the inability to deliver a service,
we're going to take the meantime to recovery now. This is the time it takes from the noticing of a failure in the availability of a system Until we're able to bring that system back up to normal operations,
we're going to sum the total number of these times that it takes, and we're gonna divide it by the number of incidents that have occurred on this will give us the average time that it takes us to recover from an availability instead.
I'm not. This is often time detailed in the service level agreement, so we will very often have a meantime to recovery goal that we're trying to beat.
We're going to keep a deviation count from our availability agreement. This is simply the amount of availability that we promised. And then we're going to subtract out the actual availability that we have recorded. And now this number is going to give us some number in the negative
range. If we under delivered the service s Oh, this would be indicative of not meeting our service level agreements for availability, whereas a number with the uh
greater than zero indicates an over delivery, and this is generally going to be the goal. We would like to have our service available for more than the minimum that we have promised it.
We're going to keep account of the percentage of components that are monitored on now the number of components in a service or in a network delivery of a service very greatly. But they're all going to be defined in the service design plan that comes out of the
service design process is,
um and we're going to keep a count of the number of components for a service on. Then we're gonna divide that by the total number of components. And ideally, we're going to want this constrained percentage between zero and one to be as close to one as possible.
Now, talking about the life cycle of availability events, this graphic shows some different measurements that you might say. So time is moving along the blue arrow from left to right. And at some point in time, there is a system family.
we notice it through our processes. We fix whatever caused the failure on the service resumes. Normal operation.
The time between the system failure and the system resuming normal operations is the time to repair or the time to recovery. This is what feeds our meantime to recovery. KP i Later
Now, as the service starts normal operation and goes along in time again him age experience. It's next system failure.
The time between the resuming normal operations and the time that it fails is called the time to failure. And now another K p I. That may be monitored is the meantime to failure. And this is the average amount of time that the service will run
before expecting Ori and sing some sort of system failure.
Now, looking at the the large time gap between these two system failure events from the start of one system failure to the start of the next system failure is known as the time between failures and for a lot of components with a known lifespan.
Ah, we will measure the meantime between failures. This is
the average amount of time in between failures
of that particular service
for service continuity management. We're gonna want him keep a percentage of the service's which are covered by some sort of continuity plan, in the case of of serious disasters, and that this is a fairly straightforward when we're going to keep account of the service is that have a plan
on? We're gonna divided by the total number of service. Is
that we have.
This is, of course, it constrained percentage between zero and one, where ideally, we'd like to be as close to one as we can get.
We're gonna keep account of the continuity playing gaps that we've identified. These air major perceived incident threats that we haven't defined any kind of countermeasures or mitigations for these refer to major events,
earthquakes and tornadoes and things like that.
I'm generally not considered to be malicious in nature, but still very severe and needing to be dealt with,
we're gonna take a number or a percentage of those which are considered critical gaps on. And this is just the count of gaps which have been defined his critical, um and then we're gonna divide it over the total number of count of gaps that we've identified.
Now the meantime to plan is a life cycle measurement, which is the days
of which is a major event
has been identified but left unprotected.
and then we divide that by the number of risk that we have planned, um,
for those days
now, this kind of describes the average life span that a risk is allowed to remain unmitigated in our environment. so would like our mean time to plan to be lower if possible,
we're gonna keep account of the disaster drills that we perform in. These could be anything from fire dribbles and tornado drills,
depending on what you specifically need
awareness for in your organization in your area.
But we're going to keep account of those
and then for information, security management. It's important to understand
that the KP eyes that we talk about here
Onley represents a small fraction.
there are several schools of thought on measuring security
and these are just a few of the ones that I see fairly common.
There are books and on entire classes taught on this subject
s O. The first we're gonna talk about the count of identified vulnerabilities you should in your network as a best practice, be using vulnerability identification, usually in the form of some vulnerability. Skinner on. You're gonna want to keep account of the identified vulnerabilities,
but they can also come through other sources, like vulnerability testing, which we'll talk about in a moment as well.
So we're going to keep of that a percentage which are unpatched vulnerabilities. It may not always be possible to patch every known security vulnerability, and it also might not be cost effective even if a patch is known to exist.
Um, so we're going to keep that and we're going to take the count of vulnerabilities without a mitigation for them. And we're going to divide that. Of course, by the total known vulnerabilities,
we're gonna keep a percentage of unpatched critical vulnerabilities, which is another sub percentage. So some percentage of our unpatched vulnerabilities
will also be marked critical vulnerabilities. This is account of the critical vulnerabilities without mitigations divided by the total count of unpatched vulnerabilities.
The meantime to patch is again a lifecycle management KP I, which is the sum of the days that
of vulnerability is left unpatched from the time it's known to exist in the network for the time of mitigation is reached for it. And now patch this kind of a misnomer. Here, Um,
a mitigation strategy can involve just accepting a risk. So if you
go three days and then you decide that you can't patch or you don't need to patch ah, particular vulnerability,
then that would go in here as reaching its mitigation or being quote unquote patched.
So we're gonna take the sum of those days of unpatched. Um, and we're going to divide it by the number of service is that we've patched vulnerabilities
on. And then again, this just kind of describes the life cycle of a vulnerability within your network, and we want to get our meantime to patch lower using automatic automatic patching frameworks and other things to help us.
Now we're gonna keep account of the security related incidents that we've detected. And this can come in many forms either reported incidents from users or the number of incident logged from our security incident an event management software
on. Then we're gonna keep account of those related incidents that we've responded to. Since it is not practical to respond to all incidents and not all reported incidents will in fact be valid security incidents.
We're going to keep account of the security tests that we perform now. These are the malicious activity events. We're going to, um,
try to break into our own software and pretend to be the bad guys.
There is a whole separate set of KP eyes that are used for
monitoring and leveling security tests, which we're not going to get into here, but overall, you'll use the results of those metrics to inform other metrics for this section of information security.
Um, and we're going to take a percentage of the security, the the staff that's been certified in some form of security knowledge. Now, this doesn't have to be, um,
a big
information security manager, certification or anything of that symbol. Knowledge courses and exposure toe ideas is plenty.
So we're gonna take the count of our staff that have some kind of security awareness training. We're going to divide that over the total number of our staff that we have in general on. And that will give us some constrained percentage between zero and one where, ideally, we would want this again to be as close to one as possible,
because the more people with security awareness training in our organization, the better our security stance will be.
So again, I would like to bring a good hearts law and show more concrete example of how it can affect our best intentions with KP eyes.
Suppose for a moment that you want to make one of these measurements that account of critical vulnerabilities discovered a za kee p. I target to help drive the number of discovered vulnerabilities in one way or another. What might happen just so what? What
pause for a moment?
I think about that.
So what I came up with is, if we're aiming to lower the number of critical vulnerabilities than what we're actually doing is discouraging developers from classifying bugs or vulnerabilities that they find as critical in the first place
to lower the number simply stopped reporting Things is critical.
This makes it hard to find in actual critical bugs, as everything will come in under the next level. So maybe if it's not critical, it's medium. And so a lot of your critical bugs will now get mislabeled as medium and they'll be hidden amongst a lot of noise
on the opposite side. If you're aiming to raise the number of critical vulnerabilities that have been discovered,
this encourages overreaction and over adjusting so that you now miss label even minor vulnerabilities as critical.
And these were real world problems that are faced by bug bounty programs. Several companies have tried to drive users through rewarding them for the severity of bugs that they have are able to uncover and report.
Um and this is not always worked in their favor.
This cartoon here is Ah, fun little example
Nuffer Supplier Management.
We're going to keep account of the signed supplier contracts that we have on this pretty self explanatory,
Um, along with account of the contracts that we have under review these air contracts that we currently have signed that were reviewing either for renewal or termination.
Now we're going to keep count of the identified breaches of contract when will usually keep this by service provider. That way, if we have multiple contracts with the single provider
were totaling the number of identified breeches. And ideally, we like to keep this at zero.
We're going to keep a mean count of the vendors considered per contract, which is just the number of providers that we've considered per contract divided over the total number of contracts that we have active.
And the reason that we're gonna want to keep that is it helps ensure that we're considering our options when we're bringing on a service. We don't want to get tunnel vision and continue to go with one service provider simply because we've gone with them in the past.
It is important to keep vendor relationships in mind when you're making these decisions.
But you would like to consider a knave Ridge number of providers before making any contractual obligations.
With that, we've come to the end of this video. I'd like to thank you for watching. And as always, if you have any questions, you can contact me on cyber harry dot i t my user name ist warder T w a r T E r.
Up Next
Axelos ITIL Foundations

This ITIL Foundation training course is for beginners and provides baseline knowledge for IT service management. It is taught by Daniel Reilly, one of our many great cyber security knowledge instructors who contribute to our digital library.

Instructed By