By: Nihad Hassan
February 23, 2022
Defining The Incident Management Process
By: Nihad Hassan
February 23, 2022
In an accelerating digital world, any drop in IT service can have catastrophic consequences on the affected organization. For example, suppose a cybersecurity incident occurred and resulted in ceasing a major online retailer's IT system and networks; how much money of losses do you expect? According to the Ponemon Institute study, a DDoS attack will cost an average of $22,000 for every minute of downtime it causes.
Developing an incident response strategy became crucial for any organization that wants to survive in today's information age to keep pace with the rapidly changing nature and complex IT landscape. This article will introduce the term "incident management process," discuss why it is essential, and list the critical phases of the incident management process.
Defining Incident Management
In the context of IT, incident management refers to all the processes, best practices, and procedures an organization takes to respond to any problem that affects the regular operation of IT systems and computer networks. The incident management process should outline what procedures will be used to resolve the incident and restore normal business operations quickly as possible.
As we all know, IT has become a key enabler of modern work practices. For instance, organizations of all types and sizes utilize technology heavily in their work processes; any sudden service interruption can lead to severe financial and reputation loss. IT incidents can be caused by diverse reasons and include the following and more:
- Internet connectivity problems in both Wi-Fi or wired lines.
- Grid power outage.
- Email service disruption.
- File sharing problem prevents sharing files/data across the network.
- AD authentication error.
- A problem occurred by a hardware failure, such as a networking device crash.
- A software failure in some applications or caused by unsuccessful deployment or wrong configuration in applications or operating system settings.
- Cybersecurity attack that resulted in:
8.1. Denying access to IT resources for legitimate users – such as DDoS. 8.2. Denying access to data by infecting them with ransomware. 8.3. Planting malware to exfiltrate data and to spy on target IT interactions. 8.4. An advanced persistent threat group gains access to sensitive resources.
Why Incident Management Strategy Is So Important in Today's Digital Age
It has become an essential requirement for most data compliance requirements; knowing how to handle various IT incidents will allow your organization to respond promptly and follow a predefined procedure to continue its daily operation without disruption. Some advantages from having an incident management process in place include:
- Security vulnerabilities can be addressed more quickly and prevented from becoming a direct threat.
- Reduce the impact of security incidents. For instance, responding early to a ransomware attack can prevent spreading the infection to other network places.
- Avoid paying huge fines. Some cybersecurity incidents result in a data breach; this will make your organization liable against compliance bodies, such as GDPR, PCI DSS, and HIPAA.
- Enhance the Mean Time to Repair (MTTR), which allows the digital asset to remain operational for a long time.
- Mitigate and lower downtime, which saves a lot of money. According to Gartner, the average cost of IT downtime is $5,600 per minute.
Incident Management Processes
A typical process will be composed of the following phases:
Incident identification and logging
The first step is the identification phase. An incident can be identified manually by an end-user, an IT specialist, an analysis report, or an automated notification service, such as website downtime notification services that inform the organization of the interruption. After that, we need to log its details; this includes recording all facts relevant to the incident, such as:
- Type of device/s – such as a server, workstations, laptop or networking appliance, etc.
- Device operating systems – such as Linux or Windows 10.
- Type of application caused the incident along with its version number, such as an MS Office 2019/Excel.
This phase also involves categorizing it according to its type; this helps the incident team to:
- Group similar incidents in one table to understand their frequencies and trends.
- Priorities incidents.
- Track incidents until they are resolved.
For example, network monitoring tools are commonly leveraged to identify connectivity and other security problems in addition to auto-discovering all devices connected to the same network. An example of such a solution is the SolarWinds Network Performance Monitor.
The second phase is escalating; if the first responder cannot solve the incident on its own, they need to send alerts to other people, such as a senior IT specialist, response team, and other relevant employees in the company.
The escalation procedures and strategies differ widely from one organization to another. For instance, in organizations that run mission-critical work, such as IT service providers, financial and health care organizations, the notification procedure involves notifying many people, such as:
- Chief Information Security Officer, or other IT Management.
- Response team.
- Legal department.
- Public relations and marketing department.
- Suppose the incident resulted in a data breach and a criminal act. In that case, you need to consider following your government laws and compliance requirements and notify government agencies and end-users/customers if applicable.
The notification can be sent via SMS, in-person, email, or phone calls. However, communication via voice remains the optimal option, especially for critical incidents.
For example, in healthcare organizations that follow the HIPAA act, any healthcare organization that suffered from a data breach involving unsecured protected health information must notify:
- Affected individuals by mail or email, and if there are more than ten individuals with outdated information, the breached entity must notify them by posting a notification on its website.
- If the breach has affected more than 500 individuals in the same jurisdiction, the healthcare organization must also inform the media.
Investigation and Diagnosis
Once all involved parties are notified, the roles of each member are assigned to begin investigating the incident to determine the root cause and follow a proper procedure to solve it and restore normal business operations.
In general, investigating involves the following steps:
- Identify root cause.
- Understand the sequence of events.
- Recognize the complete impact on normal work operations.
- Identify the event or events that trigger the incident. For example, after performing an update or implementing a patch.
- Search previous incidents -if you already have a database- and determine if a ready resolution existed.
For example, the incident team may need to update a specific server operating system, install new hardware devices, or request installing additional security solutions (firewalls, IDS/IPS) on the network gateways to solve the problem. Some incidents may not be caused by a cyberattack, such as power outages, and solving them could be through installing UPSs or providing a backup generator.
In this phase, the incident's root cause is resolved and the regular work operations restored. A proper procedure should follow to ensure that the problem is entirely resolved and will not happen. For example, after a ransomware attack, it is not enough to only restore data from backup. You need to make sure all infected systems are cleaned, and the malware does not still exist somewhere in the network to avoid spreading the infection again.
In a typical resolution phase, the incident team may need to ensure the work conditions have been restored to meet the Service Level Agreement (SLA) between the IT provider and its clients.
After all involved parties agree that the incident has been resolved entirely, the final phase begins by documenting everything that happened, documenting why it occurred (root cause), and suggesting the best remediation steps to avoid falling into the same problem again.
A post-incident phase is crucial for developing future remediation strategies, as, without proper documentation, an organization may suffer from the same issues frequently.
The accelerated digitalization of society brings various benefits to organizations; however, it also increases the complexity of managing automated operations. As more functions become wholly dependent on technology, a sudden failure in internet connectivity, computer networks, software, hardware, or power outage can cease most work and cause severe financial and reputation damage to the affected organization. This article defined the incident management process and mentioned the key phases it is composed of. It is essential to respond on time to the various incidents that can affect the normal operations of IT systems and networks to contain incidents and avoid the different negative consequences.