Rein in Your Incidents: Incidents and Alerts Foundations

Solving incidents is hard. Depending on your current situation, you may also be losing a lot of time figuring out what notifications constitute an incident. This results in more and more lost time as every notification must be triaged as a potential incident before you can proceed to move to resolve or disregard (as a non-incident). All this may sound very cumbersome, but the fastest way to improve is to learn and define what incidents are. And you're in luck! That's what this blog post is about. Once you're done, you'll be well on your way to start cleaning up your processes to make incident identification and resolution easier with each iteration.

First things first: what is an incident? I define an incident as "any unplanned disruption or degradation of service that is actively affecting customers ability to use a service." I also separate out what we call a major incident, which is "any incident that requires a coordinated response between multiple teams." This is different from an "event" which could be, well, anything. Writing a log message is an example of an event, but being unable to write a log message could imply an incident (or an incoming one). Understanding the disruptive nature of an incident is key in its identification: if you're receiving a message (email, chat, SMS, call) and its information that is non-disruptive, then it is not an incident. 

Making time to create and maintain well designed alerts is key. Why? Because not all alerts imply incidents, but the goal is for all incidents to be identifiable via alerts. If you do not plan your alert messaging, then it's likely that a serious incident will be missed in the general chatter of regularly operating systems. There are a few parts to this:

  • All alerts should be actionable
  • Alerts should be as "noisy as makes sense" (more on this in a moment)
  • Alerts should be kept in sync with changes to your system / code, e.g. don't alert on a migrated endpoint

Taking a closer look at that second item: what do I mean by "as noisy as makes sense"? Some of this is identifiable by gut feel: you don't want to be woken up at twilight hours because the database is using too little memory. (What?) On the other hand, you do want to be woken up for a customer facing impact originating as failed database connections. For the more nebulous alerts, and even some of the more common, you'll need to set some clear definitions so you'll know how to categorize each one: priority, urgency, and severity. Simply put:

  • Priority -  how quickly and in what order an alert / incident should be addressed. A high priority alert needs to be addressed immediately, low needs action at some point, and informational is just an "FYI alert". In addition to "high/low", you'll usually see breakdowns like "P1, P2" or "SEV1, SEV2" for this category. 
  • Urgency - we at PagerDuty use this to define how you want to be notified. In general, this is in lock-step with priority. By that I mean that a high urgency notification (typically a call or text) is for a high priority alert, low urgency (email, chat) to a low priority alert.
  • Severity - how serious the issue is, typically defined using critical, error, warning, info, etc.

For a deeper dive into these, please take a look at our Incident Response Ops Guide, specifically Alerting Principles and Severity Levels.

While you're thinking about alerts, let's fork onto a related topic: the incident itself. When there's an alert that meets the criteria for an incident, you now have an active incident. That's great! 😅 Incidents are typically resolved via a call - this could be a phone call, web conference, both, or in the Before Times the relevant responders could physically get together in a conference room. Whatever the case may be, now everyone needs to be able to communicate. Some basic rules of engagement will help with this: take steps to ensure that people don't speak over each other. 

Some basic rules of engagement will help prevent people from talking over each other. For example, reactjis in Zoom or raising hands on camera.  Make sure your mic is muted if you're not the active speaker and/or use something like Krisp to keep background noise at bay. Also, regardless of whether you’re using speech or text, make sure to avoid acronyms and jargon as much as possible. The extra labor to type or say "Subject Matter Expert" instead of "SME" prevents the need to spend time defining acronyms on the call. This is extremely relevant as, especially in the case of a major incident, not everyone on the call may share the same acronym usage and jargon. There may be front end, back end, UI, legal, and HR representatives either on the call or receiving updates. Actually, more likely than not there will be non-engineers on the call.

Why is this? There is a lot that goes into handling an active incident, and the triage and resolution is only part of it. This is the focus for the engineers, i.e. Subject Matter Experts; however, who is communicating to the teams, company, or even externally status updates? Who is documenting what's being tried and when? Ideally, the engineering team that is working on solving the incident should not be taking on any other tasks. This means that other groups can, and should, be on call. At PagerDuty we've defined these separate roles as our incident command process. They are: 

  • Incident Commander
  • Deputy
  • Scribe
  • Subject Matter Expert
  • Customer and/or Internal Communications Liaisons 

Briefly, the incident commander is who makes the final call on what steps are being taken as well as when. They rely on input from the relevant engineers / subject matter experts to guide the decision, but no action is taken without their go-ahead. The deputy is second in command, there to assist the incident commander and take over managing the incident if necessary. The scribe is documenting the incident as it happens, e.g. "restarted Kubernetes cluster at 8 AM, cluster back online". The subject matter experts are the ones who understand the services and systems impacted by the incident and are actively working to resolve it. And finally, the liaisons are responsible for communicating status updates. This is vital to have as a separate role as internal executives and representatives will need updates that they can also communicate to their own stakeholders, customers, etc. For more information about how to train these roles, please take a look at our Incident Command Training page.

Whew! I know that was a lot, but now you're well on your way to streamline your alerts and incidents. Please feel free to reach out to us on our Community Forums if you have any follow up questions or would just like to chat. We'd love to hear from you!


Tags:
Observability
Tags:
Devops Engineers
Tags:
SRE

Related Posts