February 28, 2024

Incident Management at Astronomer: 1 Year Later

Kevin Paulisse Principal Software Engineer Astronomer
Collin McNulty Sr Director, Global Support Astronomer

As companies rapidly adopt Astro to run their mission-critical tasks, Astronomer is investing heavily into the reliability of our platform. Engineers strive to incorporate resilience, self-healing, and high-availability wherever possible. Today, we run a unified incident management process built on a philosophy of "you build it, you support it" with custom-built internal tools. While no engineering system is without fault, reliability is a core pillar of our product and we take pride in our continued investment in tools and processes to ensure we deliver a reliable platform. This article explores the history of incident management at Astronomer and shares some insights into our current processes and tooling.

History

Sharing this history is not flattering. However, taking an honest look at the past allows us to learn from our mistakes and not repeat them – just like any good incident response process!

In the early days, formal incident management processes did not seem necessary because there was a small group of engineers who understood the entire system and could troubleshoot any problem. As the company grew, engineers divided into teams specializing in their own areas of an increasingly complex product architecture. Incident management became more complicated as organizational structures, areas of responsibility, and lines of communication now had to be considered.

At Astronomer, the Research and Development (R&D) organization is responsible for product and infrastructure engineering and the Customer Reliability Engineering (CRE) organization interacts directly with customers. Each of these organizations had built its own incident management process. As a result, each incident played out in multiple Slack threads (one for R&D and one for CRE), and pages were frequently sent to the wrong engineers because CRE escalation procedures and R&D on-call practices did not align. This bred confusion, inefficiency, and distrust, especially across organizational boundaries.

Unified Incident Management Process

Looking back, much of the dysfunction in our incident management processes could be traced back to one obvious failure: the lack of a consistent process across the company. The solution was therefore quite obvious: R&D and CRE needed a single, unified process. This can be easier said than done for two organizations that report to different executives, but the engineers who were “in the trenches” during actual incidents insisted that R&D and CRE be equal partners in developing the new process.

In February 2023, R&D and CRE sat down to author a new, unified process. The authors discovered that the prior incident management processes were fairly consistent and had a lot of overlap, which is not surprising since incident response processes are more or less standard in this industry. They focused on keeping what was working well and discarding what wasn’t. Some of the highlights of the unified process include:

Definition of incident to distinguish an incident from routine support requests, bug reports, and the like. We settled on this: An incident is an unplanned interruption to a service or reduction in the quality of a service, requiring real-time coordination among responders seeking to mitigate the interruption.
How to communicate during an incident. CRE had a practice of creating a Slack channel for each incident and that was working well, so we adopted that into the new process. R&D often held a Zoom call during serious incidents, so the process reminds participants to link that call from the channel.
Severity levels. R&D and CRE had different scales, so we created a consistent set of severity levels for customer-facing incidents tied directly to our most fundamental offering (the customers’ ability to run and deploy DAGs).
Incident Manager On-Call (IMOC). We created a rotation of senior managers and principal engineers who can be brought in to coordinate response and make critical decisions during a particularly severe or complex incident.
Postmortem. After each incident, we perform a root cause analysis via a blameless postmortem. We create and track incident remediations, which are action items to avoid similar incidents or improve our detection or response.
Avoiding process for process sake. An effective process is intuitive and natural. For incident management in particular, we avoid unnecessary distraction when working on an incident.

Even with the most optimal process imaginable, it is not realistic to expect that a human will remember everything, especially in the midst of an emergency. We knew that we needed automation around steps like creating a Slack channel, inviting responders, and correlating the incident to customer tickets coming in via Zendesk.

To address this, we considered a plethora of software products and online incident management services. Some of the capabilities we explored were very impressive. However, adopting any of these tools also required adopting the tool’s model of incident management, and both CRE and R&D strongly preferred to augment the existing process rather than adapting to a new model. Therefore, we created an internal Slack bot (named Incident Buddy) to implement our process, automating the creation of Slack channels, Zendesk tickets, and postmortem documents during an incident.

Self Responsible Teams

Until about a year ago, every incident, alert, and escalation paged the infrastructure team (whether the problem was related to the infrastructure or not). Infrastructure engineers played “air traffic control” to triage every incoming alert or escalation and find the responsible party. This was bad for the engineers who were unhappy about being woken up in the middle of the night for problems they had no agency to address, and the unnecessary human escalation step increased the time to resolution for incidents.

Self Responsible Teams is what we call our philosophy of “you built it, you support it.” Instead of a centralized team of engineers that is on-call for everything, we have a collection of on-call rotations divided by area of responsibility. Developers who write the code for a given service are also responsible to support it. Service owners are encouraged to implement metrics and alerting for the health of their service and be notified directly if there is a problem, before it becomes an incident.

The biggest barrier to adopting Self Responsible Teams is breaking the news to developers that they now need to be on-call. In our case, these engineers were used to getting paged on an ad-hoc basis anyway, so formalizing our processes turned out to be an overall improvement.

Generally speaking, being on-call is a burden in one’s personal life. Even if an on-caller is rarely paged, the requirement to be available cannot be overlooked. If someone must be online within minutes of receiving a page, they are effectively stuck at home since they can’t get to a computer fast enough if they are driving, at the grocery store, or at the gym.

We implemented two key policies to try to mitigate the effects of on-call on engineers’ lives. First, we stipulate that manual escalations (pages) are permitted only during an incident, eliminating concerns that pages are arbitrary or that tickets are just being “thrown over the wall.” Second, we adopted an engineering-wide policy that prevents any one person from being on-call more than one week per month and explicitly calls out that someone paged off-hours should take the necessary time to catch up on sleep or personal responsibilities.

The main challenge with Self Responsible Teams is mapping which types of problems need to go to which on-callers. This mapping needs to be updated continuously to account for reorganizations, new projects, and even internal movement of engineers between teams. At the time of this writing, we only have about ten on-call services, and just using a table in a shared document is suitable for us. However, we know that this approach won’t work at a larger scale.

Applying this to a real incident

Let’s take a look at how this plays out in practice with a real, recent incident we’ve had at Astronomer.

It’s Monday, January 22, 2024 at 3:25am PT. Astro’s Core API - our “control plane” API that powers how customers manage their Airflow deployments - is running across several pods in our Control Plane Kubernetes cluster. Some of these pods begin crashing occasionally, triggering alerts. These alerts, though, auto-resolve as the pods get restarted and become healthy again. Self-healing is expected behavior in Kubernetes, so the engineer receiving the alerts concludes that this is a transient problem that can wait until the next day, and goes back to sleep.

At 3:55am PT, an engineer on our Customer Reliability Engineering team correlates these alerts to a customer ticket in Zendesk and determines this is, in fact, a real issue. That engineer calls a “nova” - our internal codename for an incident (we tend to like space-themed names here at Astronomer). This engineer goes to Slack, where Incident Buddy is available. He runs the command /ib create, and fills out some basic info - the title (“Astro platform unavailable”), the severity (S1), the impact (multiple customer outage), and a description with what we know at that point.

Incident Buddy then creates a slack channel and invites a handful of engineers to start. This also gets posted to an internal #incidents channel, where many of us have Slack alerts set up so we can jump in when possible.

Within a minute of the incident being called (literally, at 3:56am PT), the engineer answers the most critical question of any incident: does this affect customer’s deployed pipelines or prevent their execution? The answer, fortunately, is no. This question comes up every single time we have an incident, as our primary responsibility as a managed service provider is to ensure when customers deploy code, we run it on time. We take this very seriously as our customers deploy very critical processes to Astro.

Within a few minutes, the CRE team updates our status page as everyone continues the triaging process. Engineers who were paged in (or who joined on their own when they saw the incident get created) continue sharing helpful information - logs, current deployed versions, etc. Initially, these live as messages in the incident Slack channel. However, as the team notices certain messages are helpful to keep as a papertrail post-incident, they add the “receipt” emoji reaction in Slack. Incident Buddy will then automatically collect those for us as a papertrail. This lets us move quickly in the moment while still collecting information for the postmortem.

Separately, as the troubleshooting continues, the team comes up with ideas to (a) make the troubleshooting process easier next time and (b) make our services more resilient. As these items are shared in Slack, engineers add the “push pin” emoji, which causes Incident Buddy to collect the messages for further discussion in the postmortem without slowing down the immediate response.

It is now 4:29am PT, and we’ve identified a recent pull request that introduced a breaking change and have begun our hotfix process to revert the offending change. We use a set of dashboards to identify service errors with this particular API, and by 4:38am PT, we confirm internally that the new version is rolled out and that all looks good. We monitor for a few more minutes and confirm with stakeholders that the issue is no longer occurring. At 4:43am PT we update our status page to reflect that the issue is fixed. We declare the service interruption as taking place from 3:25am PT to 4:07am PT. CRE starts to follow up on the customer tickets that were related and immediately writes a short summary to share with impacted customers.

It is now Tuesday morning. CRE has finished following up on customer tickets, the problem has not re-occurred, and the engineers involved have caught up on their sleep. We now advance the incident to the postmortem stage. An individual from the engineering team that owns Core API is assigned to write the detailed postmortem. That individual can draw from information recorded in Slack messages or Incident Buddy: a log of papertrail and postmortem messages, a full history of troubleshooting messages, and relevant dashboards, pull requests, etc.

A week later, the broader team reviews the internal postmortem, which includes a detailed root cause analysis and specific action items that we should take for this incident and to prevent incidents like it in the future. We recognize that disregarding the initial alert was incorrect and that doing so delayed our response – in our blameless process, we do not criticize the engineer but instead develop specific action items to improve the alerts. We also suggest improvements for the Core API configuration and examine whether changes to our deployment processes are necessary. We check our logs and calculate that we had a number of failed requests between 3:25am and 4:07am PT, which CRE uses along with the internal postmortem to write the public-facing postmortem.

The current iteration of the process allows us to balance time-to-resolution by not introducing process overhead, but also simplifies the follow-ups by making it easy to collect information throughout the process.

Learnings and next steps

Our new incident response process and tooling has streamlined the communication during and after incidents, which has in turn increased collaboration and trust between CRE and R&D in general. We recently conducted internal interviews with employees involved in the incident management process, all of whom strongly agreed that the process is working well. It is highly beneficial both internally and for our customers to keep this going.

Building the Incident Buddy tooling was the right decision for us given our needs at the time. However, there are limitations and opportunity costs of rolling our own tooling instead of adopting third party solutions that offer more functionality. Given the impressive nature of the incident management products on the market, it is highly likely that we will shop around for a comprehensive solution to replace Incident Buddy within the next couple years.

Self Responsible Teams was the right direction for our engineering organization. All too often, infrastructure teams are tasked with general on-call obligations even though building and maintaining infrastructure is a specialty in and of itself. Having all alerts go to the infrastructure team was especially unnecessary at Astronomer because we have customer reliability engineers who interact directly with customers.

Even with the best incident response process, humans are still in the critical path. At Astronomer, the entire team - be it sales, support, or engineering - cares about delivering reliability to our customers. We've seen this process work well: when there are critical incidents, the entire team gets online to resolve it as quickly as possible. We look forward to continued investment in automation that brings about high availability, redundancy, and self-healing, and use our incident response metrics and postmortems to measure our progress.