Extracting and Cleaning Data Using Python Lab

Infosec Learning
Virtual Lab

Extracting and cleaning data are just two components of the data wrangling process (gathering, extracting, cleaning, and storing data). Where extracting data is the process of drawing out only relevant data in an attempt to answer a fundamental question(s) during analysis. And where data cleaning involves the removal of data that may have negative ...

Time
1 hour 30 minutes
Difficulty
Intermediate
Share
NEED TO TRAIN YOUR TEAM? LEARN MORE
Join over 3 million cybersecurity professionals advancing their career
Sign up with
Required fields are marked with an *
or

Already have an account? Sign In »

Overview

Extracting and cleaning data are just two components of the data wrangling process (gathering, extracting, cleaning, and storing data). Where extracting data is the process of drawing out only relevant data in an attempt to answer a fundamental question(s) during analysis. And where data cleaning involves the removal of data that may have negative impacts on the true data’s behavior. These things include missing or deleted data, unexpected character types (commas, semicolon, numbers, etc.), outliners, unexpected values, different formats (US or European), etc. In this lab, we will be working with the kddcup.data.corrected dataset to prepare it for analysis. First, we will use Python to separate the data out based on its classification (normal or abnormal). Then we will use Python to clean the data by removing the flow labels and punctuation marks that may cause problems with our model. Last, we will import the data into Pandas and explore how to structure, shape, and clean the data using a statistical Python libraries.