Overview
Extracting and cleaning data are just two components of the data wrangling process (gathering, extracting, cleaning, and storing data). Where extracting data is the process of drawing out only relevant data in an attempt to answer a fundamental question(s) during analysis. And where data cleaning involves the removal of data that may have negative impacts on the true data’s behavior. These things include missing or deleted data, unexpected character types (commas, semicolon, numbers, etc.), outliners, unexpected values, different formats (US or European), etc. In this lab, we will be working with the kddcup.data.corrected dataset to prepare it for analysis. First, we will use Python to separate the data out based on its classification (normal or abnormal). Then we will use Python to clean the data by removing the flow labels and punctuation marks that may cause problems with our model. Last, we will import the data into Pandas and explore how to structure, shape, and clean the data using a statistical Python libraries.

