Not all incidents are created equal — prioritizing incidents based on theimpact they have on your business improves collaboration and makes for fasterincident resolution.
But how do you prioritize incidents?
Enter severity levels.
What are severity levels?
Severity levels is a measurement of the impact anincident has on your business. Commonly usedseverenity ranking is from SEV 1 (severity 1) to SEV 3 (severity 3), where SEV 1is a critical incident and SEV 3 is a minor incident.
SEV 1 incident could be a situation when a service is down for all users orcustomers, when there has been a major security breach or when customer data arelost. A SEV 1 is defined as a critical incident with high impact on thebusiness.
SEV 2 incident could be a situation when a significant part of the corefunctionality is not-working or when a service is unavailable for a subset ofusers or customers. A SEV 2 is defined as a major incident with significantimpact on the business.
SEV 3 incident could be a situation when a system issue causes a slightinconvenience to the users or customers, but doesn’t influence any major systemfunctions. A SEV 3 is defined as a minor incident with low impact on thebusiness.
|SEV 1||Critical incident with high impact||A service is down for all customers|
|SEV 2||Major incident with significant impact||A service is down for a sub-set of customers|
|SEV 3||Minor incident with low impact||A bug is creating an inconvenience to customers|
The levels can go beyond SEV 3. At larger organisations SEV 4 and SEV 5 areoften used. The number of severity levels can be determined by eachorganisation, but 3 levels are generally enough. More severity levels can leadto confusion and more time spent on accessing which severity level an incidentis instead of actually going forward and start working on the resolution.
Why are severity levels used?
Severity levels isn't just just fancy speak of DevOps teams. SEV levels puteveryone on the same page when an incident happens and can significantly improvethe incident response time.
Main benefit of using severity levels is that a team can connect a level to aspecific process or automation so whenever such incident occurs no improvisationis necessary and pre-made workflows are started.
For example a SEV 1 incident could be connected to an immediate statuspage update and to alerting an c-level company executives.A SEV 3 incident on the other hand can be connected to a much low-level workflow— for example a ticket being created in Jira.
Severity vs. Priority, what’s the difference?
In most cases, severity = priority.
The more severe the incident is, the more of a priority it is for the developerteam. An infrastructure incident that takes down the whole company onlinepresense is the highest priority for the DevOps team right away. But in somecases, you can have a high-priority incident that is not high in severity.
For example, if a recent homepage edit causes that the h1 title tag is notformatted properly, it’s certainly not very severe as the core functionality isnot affected. However, it’s a high priority because it can damage the brandimage of the company and cause confusion among current or potential customers.
Similarly, you can have high-severity, but low-priority incidents. For examplean incident that’s is making your product unavailable for 0.01% of all yourcustomers has a critical impact, because it’s making the product unusable. Butit’s low-priority because it’s only influencing a very small subset ofcustomers.
Because these low-severity + high-priority and high-severity + low-priorityincidents exist we need to distinguish the differences between severity andpriority:
- Severity measures the impact an incident has on the business — It answersquestions about the consequences of an incident.
- Priority measures incident’s urgency — It answers questions about whatshould be fixed first.
The fact that priority tells us what should be fixed first it’s usually betterto focus on working with priorities instead of severity levels. Let’s have alook how the priority levels first approach looks like.
How to use priority levels?
Priority levels work same as severity levels when it comes to numbering. Thelower the number the more priority the incident has.
The main difference is that priority level tells us what incident needs to besolved first, instead of just stating which incident is the most severe (has themost impact).
|P1||Critical incident that needs to be addressed immediatelly.||A service is down for all customers.|
|P2||Major incident that needs to be addressed quickly.||A service is down for a sub-set of customers.|
|P3||Minor incident that can be handled within working hours.||A bug is creating an inconvenience to customers.|
Simplyfying things: issues with P1 and SEV 3
Severity and priority levels are great in theory, but in practise they are oftentoo complicated. The main reason for having a severity levels setup is tosimplify incident communication within a team, not to complicate it. The goal isto say P1 or SEV 3 and get everyone on the same page immediatelly.
This is sadly often not the case. Especially in high stress situations likebeing waken up by an on-call alert in the middle of thenight. Similarly less technical people might think of SEV 3 as the highestseverity level, while it’s the lowest.
To simplify this we can switch to only using human words. This means thatinstead of using code words like SEV 1, we can use regular words like criticalincident for all SEV 1 or P1 incidents.
Here is how an alternative naming could look like:
|Standardized code word||Alternative naming|
|P1 or SEV 1||Critical incident|
|P2 or SEV 2||Major incident|
|P3 or SEV 3||Minor incident|
Defining incident levels with examples
Another way to make incident levels more approachable is to define them withreal-life examples relevant to your specific product. For example if you arerunning an Airbnb competitor you could define incident levels as following:
|Priority||Airbnb competitor definition (example)|
|P1||At least 10% of users can’t book new stay and/or at least 10% of current customers can sign in and manage their bookings.|
Privacy of confidential customer information was breached.
Some customers loss data about their bookings.
|P2||Maximum of 10% of user can’t book new stay and/or maximum 10% of current customers can sign in and manage their bookings.|
All customers can’t reschedule or change their bookings.
New users can’t add more people when booking a stay.
|P3||Some of search filters are not working properly when new users pick new bookings.|
Site is slower when loading images in listings.
This example is simplified, but the essence is that with this table anytechnical or non-technical team member has a very clear understanding what kindof incident is the company facing.
Using severity, priority or just alternative human worded incident levels is agreat way to step up your incident management.
But keep in mind that any incident levels are only as good as the workflows that are connected with them.And that real-life definitions and examples from your business are the key tothe success of any incident levels implementation.
Learn more about how to improve your incident response:
- Explained: All Meanings of MTTR and Other IncidentMetrics
- How to Create a Developer-Friendly On-Call Schedule in 7steps
- 4 Copy-Pastable Incident Templates for Status Pages
We call you when your
website goes down
Get notified with a radically better
infrastructure monitoring platform.
Explore monitoring →
Check Uptime, Ping, Ports, SSL and more.
Get Slack, SMS and phone incident alerts.
Easy on-call duty scheduling.
Create free status page on your domain.
Got an article suggestion?Let us know
Next articleAvailability Table (90%-99.999% Uptime)This availability table shows how much downtime is permitted to achieve a desired availability level.→