Balancing Data Requirements

Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with
Required fields are marked with an *
or

Already have an account? Sign In »

Time
3 hours 22 minutes
Difficulty
Intermediate
CEU/CPE
4
Video Transcription
00:00
>> Welcome to Module 3, determining data requirements.
00:00
In this module, we'll describe
00:00
various data types and originating data sources,
00:00
as well as how to map analytics to required data.
00:00
We will also demonstrate how to research
00:00
login data sources and
00:00
how to create a data collection plan.
00:00
Previously, we studied the behaviors
00:00
that adversaries exhibit and how to create
00:00
hypotheses and abstract analytics
00:00
based on those behaviors in Modules 1 and 2.
00:00
This step of the methodology uses those hypotheses and
00:00
abstract analytics to determine the data that
00:00
we need to collect in order to test those hypotheses.
00:00
This step will result in a set of
00:00
data collection requirements linked
00:00
to the behaviors they help to detect,
00:00
which will be used later on in the methodology.
00:00
Welcome to Lesson 3.1, balancing data requirements.
00:00
In this lesson, we will discuss
00:00
implications for analytic development when
00:00
determining data requirements and discuss
00:00
data collection considerations with
00:00
respect to time, terrain, and behavior.
00:00
When developing data collection requirements,
00:00
we will need to consider recall and
00:00
precision as discussed in previous modules.
00:00
As a reminder, precision is a measure of how
00:00
few false positives are generated using our approach.
00:00
Recall, on the other hand,
00:00
is a measure of how many true positives we
00:00
detect relative to
00:00
how many truly malicious events there are.
00:00
The ideal outcome would be perfect precision,
00:00
which means no false positives,
00:00
and that every selected item is
00:00
relevant and perfect recall,
00:00
which means no false negatives as
00:00
we're detecting all of the malicious activity.
00:00
These two aspects of detection
00:00
are often intentioned with one another.
00:00
Improving precision often comes
00:00
with the expensive recall.
00:00
As we're narrowing the scope of what is detected.
00:00
Doing that can help reduce false positives,
00:00
but may also cause
00:00
some false negatives or miss detections.
00:00
On the other hand, improving recall often causes
00:00
reduced precision as we are increasing
00:00
the detection scope in order to minimize false negatives.
00:00
Although we often think about precision and recall in
00:00
terms of analysis and specific analytics,
00:00
It's also a significant factor in data collection.
00:00
Essentially, configuring a sensor
00:00
is like creating a first-stage analytic.
00:00
The configuration ends up filtering out
00:00
some data while sending
00:00
the rest of the analytic platform.
00:00
It suffers from the same tension
00:00
between precision and recall than analytics do.
00:00
If the data collection is too broad,
00:00
it could cause an overload at the system level,
00:00
such as exceeding the bandwidth
00:00
required to send the data back for
00:00
analysis or at the cognitive level for the analyst.
00:00
For example, simply collecting
00:00
all possible system on events without
00:00
any filtering is likely to generate
00:00
much more data than it's worth for detection.
00:00
However, filtering out too much data through
00:00
a narrow configuration might reduce false positives,
00:00
but also blind the analytics to
00:00
important information needed for detection.
00:00
Every exclusion in the data collection configuration
00:00
is potentially a place for an adversary to hide.
00:00
Similarly, over specific collection scope can create
00:00
a brutal instrumentation that allows
00:00
an adversary to more easily evade detection.
00:00
As another example,
00:00
configuring a sensor to only collect process
00:00
access events when the source image
00:00
is exactly one specific path to
00:00
an executable would be brittle to an adversary
00:00
changing the name or location of the executable they use.
00:00
It helps to really understand
00:00
the data that you're working with and
00:00
what implications for precision and
00:00
recall your proposed data collection scheme may have.
00:00
In addition to good insight into
00:00
the malicious behavior you're hunting,
00:00
it helps to be familiar with
00:00
the common data sources
00:00
available for the platform you're hunting on.
00:00
We will review a few of these later in this module.
00:00
It will also help to keep
00:00
the three-dimensions in mind while determining
00:00
your data collection requirements
00:00
of time, terrain, and behavior.
00:00
We need to be sure our data collection plan will
00:00
ensure the right information is
00:00
collected while it is available.
00:00
Keeping in mind that an adversary may try to
00:00
erase their tracks and delete their activity from logs.
00:00
We will also need to think
00:00
through all the terrain that needs
00:00
monitoring across different levels of granularity.
00:00
Are we monitoring the entire enterprise?
00:00
Are we seeing network traffic between hosts?
00:00
Even if it doesn't cross the enterprise boundary.
00:00
Will we get insight into
00:00
the inner process communication on the endpoints?
00:00
Are we covering all the different device types
00:00
and technology types within our network?
00:00
These are examples of key questions
00:00
to consider in this process.
00:00
Finally, we need to ensure that
00:00
we are collecting the data associated
00:00
with the behaviors that we're seeking to
00:00
detect along with whatever contextual data we will
00:00
need in order to differentiate malicious from
00:00
benign behavior and to trace causal chains of events.
00:00
First, let's consider
00:00
the time-based elements of data collection.
00:00
We need to ensure that data is
00:00
collected when the adversary acts.
00:00
This will help detect the activity as soon as
00:00
possible and increase our chances of
00:00
taking effective action before
00:00
the adversary has a chance to accomplish their end goals.
00:00
Collecting during the event will
00:00
also mitigate the risk that
00:00
the adversary will go back and
00:00
erase their activity from logs.
00:00
In addition to collecting when
00:00
the malicious activity happens,
00:00
we will also want to collect during
00:00
benign operations to tune
00:00
our analytics and reduce
00:00
false positives as well as establish trends.
00:00
Since we don't know in advance
00:00
when the adversary will act,
00:00
this usually means we will need
00:00
to continuously collect data.
00:00
Continuously collecting data comes
00:00
with some implementation cost,
00:00
as it is sometimes impractical if for example,
00:00
some of the components of the enterprise are disconnected
00:00
from the central law collection for periods of time.
00:00
Within the constraints of your environment though,
00:00
continuous collection is preferable.
00:00
In addition to collecting at the right time,
00:00
we need to ensure that we're also collecting
00:00
data from the right places in our network.
00:00
Consider where in the network the behavior will be
00:00
visible and plan to collect data there.
00:00
Certain techniques are more visible within
00:00
an enclave than at the network perimeter,
00:00
while other techniques will only
00:00
be visible at the end points.
00:00
You should also think they're what types of devices and
00:00
operating systems are relevant
00:00
for the behavior in question?
00:00
As an example, detecting
00:00
task scheduling will be optimized
00:00
by collecting host-based event logs
00:00
from all enterprise machines,
00:00
including both servers and endpoints,
00:00
since that's where the behavior can occur.
00:00
Finally, think through the behaviors and
00:00
contextual data required to detect the activity,
00:00
distinguish malicious from the none behavior,
00:00
and connected detection to the larger adversary campaign.
00:00
In addition to event types,
00:00
which fields will be required or useful?
00:00
Focusing on low variance behaviors
00:00
will help balance precision and
00:00
recall during collection and
00:00
create a more robust instrumentation plan.
00:00
However, the behaviors that may
00:00
expand based on actual detection requirements.
00:00
In summary, balancing recall
00:00
and precision when developing
00:00
data requirements provides a strong understanding
00:00
of your data environment.
00:00
This understanding will provide a good foundation as you
00:00
consider variables such as time, terrain, and behavior.
Up Next