Diving into Data Sources

Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with
Required fields are marked with an *
or

Already have an account? Sign In »

Time
4 hours 42 minutes
Difficulty
Intermediate
CEU/CPE
5
Video Transcription
00:00
>> Hello and welcome to lesson
00:00
3.2: Diving into Data Sources.
00:00
In this lesson, we will list and
00:00
>> compare common data and
00:00
>> collection sources as well as
00:00
analyze data sources for analytic suitability.
00:00
We will also discuss considerations
00:00
for making data easier to work with.
00:00
Although our methodology is driven by
00:00
analysis of adversary behaviors,
00:00
it's very useful to have
00:00
a solid foundational understanding of
00:00
the common data sources associated with
00:00
the technologies and platforms that you're hunting on.
00:00
This makes it much faster and
00:00
easier to determine a collection strategy.
00:00
Keep in mind though, that you may be biased
00:00
towards data sources that you're more familiar with.
00:00
Be sure to fully explore all of
00:00
your options when developing your collection strategy.
00:00
To facilitate the examples in this course,
00:00
we will look at data sources relevant
00:00
to Windows and IP networks.
00:00
If looking at MAC, Linux, ICS,
00:00
or Mobile environments, you should get familiar with
00:00
the typical data sources
00:00
available for those technologies.
00:00
Given concepts developed when thinking through time,
00:00
terrain and behavior considerations,
00:00
you can start mapping those
00:00
onto the data sources available.
00:00
What's available can change over
00:00
time and is different for different technologies.
00:00
We will cover just a few
00:00
representative examples in this lesson.
00:00
Keep in mind that you may have
00:00
data sources that seemed to
00:00
provide essentially the same kind of data.
00:00
But often there are differences in needs that matter,
00:00
such as the Metadata fields they provide or if
00:00
they provide continuous monitoring
00:00
or designed for snapshots.
00:00
You will have to apply the concepts from this module
00:00
to your situation and adapt them as needed.
00:00
For a typical enterprise network,
00:00
there are some common data collection sources that
00:00
we can leverage to build our detection analytics.
00:00
From various devices in our environment,
00:00
we can collect data about
00:00
items such as processes and files,
00:00
system and application logs,
00:00
and network traffic data.
00:00
Attack data sources represents the various subjects and
00:00
topics of information that can
00:00
be collected by sensors and logs,
00:00
which can be helpful in mapping out
00:00
what data collection is possible for a given technique.
00:00
Attack data sources also includes data components,
00:00
which identify specific properties and values of
00:00
a data source relevant to
00:00
detecting a given technique or sub technique.
00:00
As you continue to research data sources,
00:00
you'll find that for each of these data types,
00:00
there may be one or more collection sources
00:00
for that type of information.
00:00
For example, Windows security event logging and Sysmon,
00:00
both provide information about
00:00
processes and file activity for Windows.
00:00
For Linux, you can configure auditD to
00:00
collect similarly detailed host-based information.
00:00
There are some open source repositories of
00:00
auditD configurations that are already
00:00
mapped to attack techniques to use as a starting point.
00:00
You can use a library like NP cap or
00:00
[inaudible] for packet capture on Linux, Mac,
00:00
or Windows, and you can use
00:00
a tool like Zeek to ingest that PCAP,
00:00
run analyses on it and produce a log of events.
00:00
Here's an example of one of
00:00
the more useful events produced by Sysmon.
00:00
Event Code 1, process creation.
00:00
This event is triggered whenever a process is created,
00:00
as the name implies,
00:00
and provides valuable information about that new process.
00:00
Within the event data section of
00:00
Sysmon events such as this one,
00:00
there are a few fields that are commonly present.
00:00
The rule name field is defined in
00:00
the config file and can be used to label
00:00
the event to provide some context about why
00:00
the configuration was set for this event to fire.
00:00
One example use for this field
00:00
would be to include the attack technique or
00:00
techniques that drove the inclusion of
00:00
this event in the data collection strategy.
00:00
Each event will also record the
00:00
UTC time that the event was triggered,
00:00
as well as the idea of the
00:00
process that created the event.
00:00
Bear in mind that the process ID is the ID
00:00
used by the operating system and can be recycled,
00:00
the Windows process terminates.
00:00
The process ID field,
00:00
bestly is not guaranteed to be unique.
00:00
However, Sysmon also provides a
00:00
process.guid or globally unique ID,
00:00
derived from the process ID.
00:00
The machines go in,
00:00
the process start time,
00:00
and the token ID associated with the process.
00:00
It is extremely unlikely for this
00:00
process.guid to be repeated accidentally,
00:00
and so as a more robust way to check
00:00
processes across time and machines.
00:00
Sysmon also provides information about
00:00
the path to the file in question and the image field.
00:00
In the case of an executable,
00:00
it also provides the original file name.
00:00
The image field provides
00:00
some additional context about
00:00
where the file is stored on disk,
00:00
but will change if the file or executable is moved.
00:00
On the other hand, original file name
00:00
is derived from the portable executable header,
00:00
and would stay consistent
00:00
even if moved to a new location.
00:00
Similar to process.gGuid and process ID.
00:00
These two fields seem similar,
00:00
but have different attributes and each
00:00
is more appropriate for different use cases.
00:00
Finally, Sysmon can be configured to compute and
00:00
log various hash values of
00:00
the image associated with an event.
00:00
Monitoring these hash values can be combined
00:00
with the use of original filename, for example,
00:00
to make it much more difficult for an adversary to modify
00:00
legitimate images or rename their own to blend in.
00:00
Events specific to process creation include
00:00
the full command line
00:00
associated with the process creation,
00:00
which can provide great insight into
00:00
the command line activity and
00:00
help distinguish malicious from benign usage.
00:00
There's also a valuable information about
00:00
the parent process which created the process,
00:00
including the globally unique ID
00:00
and its command line arguments.
00:00
Sysmon also provides useful information about
00:00
the session and integrity level
00:00
associated with the process.
00:00
Each of these fields can help
00:00
distinguish malicious activity from benign,
00:00
and provide contexts during an investigation.
00:00
In addition to Sysmon,
00:00
Windows has extensive native login capability
00:00
with a lot of different and valuable event types,
00:00
which tends to provide complimentary value
00:00
to what Sysmon provides.
00:00
There are far too many events types
00:00
to review in this course.
00:00
We recommend you read documentation on Sysmon,
00:00
Windows Event Login,
00:00
and any other login capabilities
00:00
available in your environment.
00:00
A sample of interesting native Windows log events
00:00
include logon events, registry activity,
00:00
process creation,
00:00
and events specific to schedule task creation,
00:00
which we'll use in our examples later.
00:00
>> Network activity and service creation.
00:00
Just like with Sysmon,
00:00
each of these events is populated
00:00
with additional fields and values,
00:00
providing more context about
00:00
the event to help with analysis.
00:00
In step two of this methodology,
00:00
we investigated the technique through
00:00
research and emulation of the behavior in
00:00
which we observed several data and
00:00
log events that could be
00:00
potentially useful for detection.
00:00
Now, let's look more closely at
00:00
the logs generated by this activity.
00:00
Windows event code 4698
00:00
will be triggered whenever a task is scheduled,
00:00
including in benign situations.
00:00
If we break up the technique like we
00:00
did in hypothesis decomposition,
00:00
we can analyze the differences in
00:00
the 4698 field values for the various used cases.
00:00
As an example, if we're looking at task scheduling,
00:00
which specifies a different user to run the task as,
00:00
we can look at the 4698 fields that change when
00:00
that occurs relative to
00:00
scheduling a task to run as the current user.
00:00
After implementing the behavior a
00:00
few times in both scenarios,
00:00
we noticed that the RegistrationInfo/Author field
00:00
seems to document these are scheduling the tasks.
00:00
While the Principal XML seems
00:00
to document the user to run the task as.
00:00
Based on this insight,
00:00
we can refine our abstract analytic to focus on
00:00
detecting when those two
00:00
specific XML values are different.
00:00
So far, our investigation has been based on
00:00
our own experimentation and
00:00
implementation of the behavior.
00:00
Before moving forward with this theory,
00:00
let's validate it further by looking at
00:00
the behavior of tasks that are scheduled by others,
00:00
such as those scheduled by Microsoft applications.
00:00
It turns out that the RegistrationInfo/Author
00:00
attribute is not always
00:00
set at all and is sometimes set
00:00
to the name of the vendor rather than the username.
00:00
If we write a detection analytic that alerts on
00:00
the RegistrationInfo/Author field being
00:00
different from the Principal/UserID,
00:00
it will alert on these typical benign activities
00:00
and cause false alarms.
00:00
Further investigation
00:00
reveals two other relevant fields in
00:00
the 4698 event called SubjectUserName and
00:00
SubjectDomainName which seem to consistently represent
00:00
the user and domain scheduling
00:00
the task for our experimental implementations,
00:00
as well as benign test scheduled by Windows applications.
00:00
Based on this further research,
00:00
we can update our abstract analytic again to focus
00:00
on SubjectDomainName\SubjectUserName being different
00:00
from the author value within the Principal XML element.
00:00
In Module 2,
00:00
we identified several low variance behaviors
00:00
associated with task scheduling on Windows,
00:00
such as job file creation,
00:00
task registry key creation,
00:00
network traffic to the task scheduler service for
00:00
remote scheduling and certain DLL loads
00:00
summarized in the table shown.
00:00
Now in step three,
00:00
we can add data sources to this view.
00:00
Each of these low variance behaviors is associated with
00:00
one or more log events or network activity sequences.
00:00
We can start by listing
00:00
those potential data sources and then
00:00
documenting which fields and
00:00
values we'd want to access for analytic filtering.
00:00
For example, if we're monitoring process creation events,
00:00
we would probably want to filter on a field like image
00:00
or original filename to narrow results
00:00
down to just those executables associated with
00:00
task scheduling like at or schtasks.
00:00
Similarly, we have insight into specific DLL names,
00:00
port numbers, registry key names,
00:00
and directory names associated
00:00
with these low variance behaviors.
00:00
We want to be sure that we are collecting
00:00
the required fields for each of these events.
00:00
Finally, as we investigate
00:00
potential data sources and
00:00
review events file during the sequence,
00:00
we observed that the 4698 event is
00:00
triggered after the final registry modifications,
00:00
which occur after the job file is written.
00:00
As discussed in the previous lesson
00:00
each of these data collection possibilities
00:00
comes with pros and cons,
00:00
often associated with precision and recall.
00:00
Documenting those can help
00:00
with data collection prioritization.
00:00
For example, if we plan to collect
00:00
process creation events and filter them based on image,
00:00
how will we detect task scheduling done
00:00
by the API using a custom executable?
00:00
If we collect image load events and filter on
00:00
those loading mswsock.dll,
00:00
how will we filter out what is likely to be in
00:00
many other such events that are
00:00
not related to remote task scheduling?
00:00
For event 4698, could
00:00
an adversary directly write the appropriate job file
00:00
and set the registry keys directly in
00:00
such a way that a task is
00:00
scheduled without triggering the event?
00:00
These are all additional considerations
00:00
to keep in mind while
00:00
determining your detection approach
00:00
and data collection requirements.
00:00
A key enabler for increasing the usability
00:00
of your data is a common data schema.
00:00
Having a unified data schema is
00:00
critical as it allows you to combine logs from
00:00
varying tools while avoiding
00:00
inconsistencies and duplicative work
00:00
in analytics and searches.
00:00
It also provides the means for
00:00
easier sharing of data with
00:00
the broader security community.
00:00
Your data schema should specify field names,
00:00
types, as well as element relationships.
00:00
Many data schemas have already been
00:00
developed and widely adopted by the security community,
00:00
including the examples listed on this slide.
00:00
A data dictionary builds upon the structure provided by
00:00
a data schema to make your data more
00:00
actionable for your organization security personnel.
00:00
Having a data dictionary
00:00
will help answer questions such as,
00:00
which tools populate which fields?
00:00
Which values can appear in a field?
00:00
And helps provide additional context about the data.
00:00
Your data dictionary should be
00:00
a living document that
00:00
will improve as your threat hunting
00:00
progresses and is customizable
00:00
in whatever way it works for your team.
00:00
In summary, it is important to familiarize
00:00
yourself with common data sources
00:00
that you expect to see in your environment.
00:00
But remember, not all data sources are created equally
00:00
and so they should be evaluated for
00:00
suitability for your analytic needs.
00:00
Items such as a common data schema
00:00
and data dictionary can
00:00
improve the usability of your data and
00:00
make it more actionable for hunting purposes.
Up Next