By: Pierluigi Riti
September 17, 2020
Integrating Security with Data Science
By: Pierluigi Riti
September 17, 2020
Data Science becomes more important every day. More projects are created, and steps are taken in the direction of how to use all of the data. Cyber Security can gain a huge advantage from the use of Data Science. Some of the main areas that can take advantage of Data Science are Network analysis and Spam identification. In this article, we present the usage of Data Science in Cyber Security.
The first usage I want to describe is the usage of Data Science for Phishing Detection. Phishing is one of the most common threats to Cyber Security. The 2018 Internet Crime Report shows the effect of phishing in the real world. An FBI study shows the impact of phishing attacks: the value lost between June 2016 and July 2019 is around 76 billion dollars. Twenty-six billion dollars was how much money was lost in a single domestic phishing attack. These numbers show how phishing is a real problem for the Cyber Security expert. Data Science can help reduce the phishing attack in one “simple” way. By using the data collected from similar attacks, it is possible to create a Dataset that would be useful for prediction and understanding if a site and email address is a phishing email/site.
As a Data Scientist, the first job is understanding and cataloging the data. This is a crucial part of identifying if the site is a phishing site or a normal one. This dataset is a “live” dataset, which means that the dataset should be changed to include the new site for every new site discovered.
The main challenge for the data is collecting and identifying the correct data. Another challenge is identifying the address, and then the domain can be designated as phishing.
Another area that can gain huge improvement from Data Science is Network analysis. This type of analysis can be used for two different goals. One is to improve network availability. Another, more important for our scope, is the identification of the network attack.
In this case, the datasets needed are varied. The main dataset can be built using some public datasets to identify the main network or the BitTorrent adversarial network. This dataset can be found easily, and a site like https://vizsec.org/data/ can give further information for finding the correct dataset for our use.
Another important part of the dataset is the data collected against our network. This data can be collected using the sensor on the network. The goal of this “sensor” is to collect data about the normal usage of the network. All of this data can be used to create a model of the network's normal usage, in terms of speed, site and port used, and the server where normally the user communicates.
Analysis of the data
With the data collected, it’s time to analyze the data. The resulting analysis is used to create a model useful for identifying when a Cyberattack arrives. To conduct correct data analysis, we need to follow some basic steps:
- Clean and normalize the dataset. The data collected should be first cleaned, the incorrect data removed, and then normalized to be useful. This means creating a dictionary that can be used in the analysis. For example, we can identify a phishing website with the number 1 under the columns “is Phishing/.”
- Make a basic understanding of the dataset. To perform a correct analysis, we first need to understand the data. This understanding is important for having a correct analysis and then training the correct model.
- Define the result report. With the basic understanding of the data, we need to consider how to present the data and use it to prove the attack.
- Identity the best algorithm for analysis. With the model ready, we now need to identify the best model for conducting the analysis. This is a crucial decision because choosing the wrong model can lead to a wrong decision.
- Train the algorithm. After choosing the algorithm, we need to train the algorithm to allow the systems to identify and respond to the correct situation.
The steps for cleaning and normalizing the data require some basic Python skill and some basic library knowledge, imagine; for example, we have a dataset containing a set of sites or addresses we identify as spam. Any email with this domain should be considered as spam. The model can use the value “true” or “false” to indicate if the site is spam or not. Work with the value true or false can be not the best for our algorithm, so what we need to do is to “clean” the data; essentially, we need to change the “true”/”false” with a 0/1 value. In this way, with a library like Pandas, we can easily identify the value and work with it.
A critical part is identifying the algorithm. This decision should be made, keeping some specific points in mind. The questions to which we want to find answers are similar to the question, “What problem do I need to solve? What data have I collected?” Finding an answer to the previous question is the key point of the algorithm decision. For example, an algorithm we can use to analyze the Network and identify the Phishing site is the Logistic Regression.
A brief introduction to the Logistic Regression
Logistic Regression is one of the optimization algorithms. The reason for choosing this algorithm for analyzing the Phishing site and, for example, classifying the adversarial network activity, is because that algorithm is the best when needing a model, and the answer needed is that of best-fit. What the Logistic Regression does is use an algorithm to classify the data and find the best classification.
This classification is used by the system to understand if the site or the network activity is malicious. To gain the best response from the algorithm, an important part of the job is the algorithm's training. To be effective, this type of algorithm needs long-term training. A benefit of this algorithm is that it doesn’t require high CPU usage, making it quite “cheap” in that respect.
Data Science becomes more important every day, and Cyber Security can gain huge benefits from that importance. Imagine how an IPS (Intrusion Prevention System) or an IDS (Intrusion Detection System) works. Both systems use a kind of “intelligence” to detect the network activity, then responding as a result of that. The limitations of both systems are well known. Using ML software can improve the detection of the network activity and make the decision faster and efficient. The conjunction of Security and AI is nothing new. Thinks, for example, of the system used to detect a Deepfake. In the years ahead, we can expect to see more ways to use AI to enhance our systems' security.