By: Corey Holzer, Ph.D. CISSP
October 21, 2019
How to Use the Right Tool for the Job
By: Corey Holzer, Ph.D. CISSP
October 21, 2019
In 2003, a stock market analyst company I will call Stockhouse hired me as a freelancer to fix issues with the database at the heart of their business model and revenue streams. Every morning, their proprietary scripts imported millions of records from the Stock Exchange. The data was configured for research and analysis by their analysts and their customers’ use when navigating the Market.When Stockhouse hired me, the database server experienced crashes daily while importing the data from the Exchange. The crashes cost many working hours for their IT staff and wasted working hours for the analysts waiting on the data for analysis. The home-grown solution was a solution developed by a freelancer who was no longer on their payroll and unavailable.Even when the process ran to completion, it was slow. On average, the morning’s data import took two to three hours (2-3 hours) to process all the data and update all the tables in the application’s database. Within two weeks spent studying the problem and writing a new code, I stopped the crashing of the database. I found a way to reduce processing time for the millions of records to approximately 30 minutes a 600% improvement over the old code. Here is how I accomplished it.
- Research to Find the Cause – My search for a solution started with an analysis of their production server. While the server needed some software updates, its specifications were sufficient for the assigned tasks. Next, I looked at the system logs and database logs for any indicators as to the cause of the crashes and failed imports. The problem was immediately apparent. The PHP code in the home-grown tool created high processing overhead. It iterated through each record in the source file making changes line by line and storing the lines in memory until the final write to the database. This caused a massive spike in memory utilization and led to the database hanging until physical intervention by the IT staff.
- Understand the Task – I performed an extensive code review of the code used in Stockhouse’s web-based application. Simply put, it was a complete mess. It took a few days to understand the logic employed by the code. Poor coding practices and almost no documentation in the form of either a technical review or commenting within the code made understanding the code very difficult. After many hours, I concluded to scrap the existing code and start over.
- Identify the Most Efficient Tool – At the time I worked on the project, I was very comfortable with PHP. Because of my experience, I knew anything I did in PHP would potentially run into similar issues. Particularly when dealing with a growing dataset. I chose to address the problem using the database’s SQL to perform all the heavy lifting. I love PHP but, in this situation, using PHP would be like asking a carpenter to use a screwdriver to hammer in a nail.
- Plan and then Execute the Plan in a Development Environment – I built a duplicate Database on the company’s staging server where I could develop without adding gasoline to the fire that was the production server. For the first few steps of the process, I used PHP to download the updates and to confirm the integrity of the downloaded file. I also employed a method for alerting the IT team when the script encountered an issue and stopped before making a single change to the live database. This eliminated most errors and reduced the response time for the IT team (and myself) to address issues.By employing the database’s SQL with its data-focused functionality for the remainder of the tasks; the new solution improved the integrity of the imported data, the stability of the database, and reduced the time to complete the process and make it ready for use by analysts and clients alike.
- Work Smartly and DO NOT Over Complicate the Task – I employed the following SQL functionality to fix the remaining problems:
- I used LOAD DATA INFILE to import the data for normalization in under 5 minutes.
- I used CREATE TEMPORARY TABLE FROM to copy the table structure in multiple tables needed to process the new data.
- I used subqueries to insert data into the temporary tables and to perform calculations for some of the new data fields in the temporary table.
- I ran several queries against the temporary tables to ensure all data was properly formatted.
- I moved existing tables (by changing the table names for faster rollback.
- Then I copied the table’s schema and migrated the production-ready data from the temporary table, making it available to analysts and clients.
- The script then notified the lead analyst about the data so he could review the data. If it was satisfactory, he could drop the backup of the old tables. Otherwise, he could roll back the import and start the SQL again.
- I used extensive commenting in the SQL script to explain each step. I provided a document explaining the overall process and how to troubleshoot and resolve any issues arising from unforeseen circumstances.
- Test the work thoroughly Before Moving Code to the Production Environment – After a few nights of automated processing on the staging server, I moved the code to the production server and tested everything again to ensure there were no issues. Stockhouse retained my services for a few months both to monitor the database and to work on new functionality. There were some issues in production, but those were issues beyond our control. Moreover, the problems were few and far between.
- Document Everything and be Thorough About it – Thorough documentation enabled Stockhouse’s internal staff to handle any future issues arising from changes in the format of the data file they received from the Exchange.