Platform and deployment alerts

APC includes two built-in alerting systems for monitoring health:

  • Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
  • Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).

Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.

Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.

Alert architecture

Anatomy of an alert

Alerts are defined in YAML using PromQL queries:

1- alert: ManyUnhealthySchedulers
2 expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 5
3 for: 5m
4 labels:
5 tier: platform
6 severity: critical
7 annotations:
8 summary: "{{ $value }} airflow schedulers are not heartbeating"
9 description: "More than 5 Airflow schedulers have not emitted a heartbeat for over 5 minutes."
FieldDescription
exprPromQL expression that determines when to fire
forDuration the condition must be true (for example, 5m, 1h)
labels.tierAlert level: airflow (Deployment) or platform
labels.severitySeverity: info, warning, high, critical
annotations.summaryAlert message text
annotations.descriptionHuman-readable description

Subscribe to alerts

Configure alert receivers

Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:

Email alerts

1alertmanager:
2 receivers:
3 platform:
4 email_configs:
5 - smarthost: smtp.example.com:587
6 from: alerts@example.com
7 to: ops-team@example.com
8 auth_username: alerts@example.com
9 auth_password: ${SMTP_PASSWORD}
10 send_resolved: true

Slack alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 slack_configs:
5 - api_url: https://hooks.slack.com/services/xxx/yyy/zzz
6 channel: '#platform-alerts'
7 title: '{{ .CommonAnnotations.summary }}'
8 text: |-
9 {{ range .Alerts }}{{ .Annotations.description }}
10 {{ end }}

PagerDuty alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 pagerduty_configs:
5 - service_key: ${PAGERDUTY_SERVICE_KEY}
6 severity: '{{ .CommonLabels.severity }}'
7 description: '{{ .CommonAnnotations.summary }}'

OpsGenie alerts

1alertmanager:
2 receivers:
3 platformCritical:
4 opsgenie_configs:
5 - api_key: ${OPSGENIE_API_KEY}
6 message: '{{ .CommonAnnotations.summary }}'
7 priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

Default receiver groups

APC includes default receiver groups based on tier and severity:

ReceiverTierSeverity
platformplatformall
platformCriticalplatformcritical
airflowairflowall

Custom routes

If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):

1alertmanager:
2 customRoutes:
3 - receiver: deployment-high-receiver
4 match_re:
5 tier: airflow
6 severity: high
7 - receiver: deployment-warning-receiver
8 match_re:
9 tier: airflow
10 severity: warning

Custom receivers

Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:

1alertmanager:
2 customReceiver:
3 - name: sns-receiver
4 sns_configs:
5 - api_url: <SNS_ENDPOINT>
6 topic_arn: <SNS_TOPIC_ARN>
7 subject: '[Alert: {{ .GroupLabels.alertname }}]'
8 sigv4:
9 region: <AWS_REGION>
10 role_arn: <SNS_ROLE_ARN>
11 customRoutes:
12 - receiver: sns-receiver
13 match_re:
14 tier: platform
15 severity: critical

Apply configuration

Push receiver configuration to your installation:

$helm upgrade astronomer astronomer/astronomer \
> -f values.yaml \
> --namespace astronomer

Create custom alerts

Add custom alerts using the Prometheus Helm chart:

Platform alert example

Alert when multiple schedulers are unhealthy:

1prometheus:
2 additionalAlerts:
3 platform: |
4 - alert: MultipleSchedulersUnhealthy
5 expr: count(rate(airflow_scheduler_heartbeat{}[1m]) <= 0) > 2
6 for: 5m
7 labels:
8 tier: platform
9 severity: critical
10 annotations:
11 summary: "{{ $value }} schedulers are not heartbeating"
12 description: "More than 2 Airflow schedulers are unhealthy for over 5 minutes."

Deployment alert example

Alert on high task failure rate:

1prometheus:
2 additionalAlerts:
3 airflow: |
4 - alert: HighTaskFailureRate
5 expr: |
6 (
7 sum(increase(airflow_ti_failures{}[1h])) by (deployment)
8 /
9 sum(increase(airflow_ti_successes{}[1h]) + increase(airflow_ti_failures{}[1h])) by (deployment)
10 ) > 0.1
11 for: 15m
12 labels:
13 tier: airflow
14 severity: warning
15 annotations:
16 summary: "High task failure rate in {{ $labels.deployment }}"
17 description: "Task failure rate exceeds 10% for the past 15 minutes."

Built-in deployment alerts

For a complete list of built-in alerts, see the Prometheus alerts configmap.

AlertDescriptionAction
AirflowDeploymentUnhealthyDeployment is unhealthy or unavailable for 15+ minutesCheck pod status, review logs
AirflowPodQuotaUsing more than 95% pod quota for 10+ minutesIncrease Extra Capacity or optimize Dags
AirflowSchedulerUnhealthyScheduler not heartbeating for 6+ minutesCheck scheduler logs, restart if needed
AirflowTasksPendingIncreasingTasks pending faster than clearing for 30+ minutesIncrease concurrency or worker resources

Built-in platform alerts

AlertDescriptionAction
CriticalComponentPodCrashLoopingA core platform component pod (Houston, Commander, Grafana, Prometheus, Registry) is repeatedly restarting for 15+ minutesCheck pod logs in the APC namespace, investigate the crash cause
CriticalComponentPodNotReadyA pod in the APC platform namespace has been in a non-ready state for 15+ minutesCheck pod events and logs in the APC namespace
TargetDownMore than 10% of Prometheus scrape targets for a job are unreachable for 10+ minutesCheck the failing service’s pods and endpoints
ElasticSeachUnassignedShardsElasticsearch cluster has unassigned shards for 10+ minutesCheck Elasticsearch cluster health and logs
ElasticDiskHighWatermarkReachedElasticsearch node disk usage exceeds 90% for 5+ minutesIncrease Elasticsearch storage or clean up old indices
ElasticDiskFloodWatermarkReachedElasticsearch node disk usage exceeds 95% for 5+ minutes — Elasticsearch enforces a read-only index block at this thresholdImmediately increase storage or delete old indices
IngessCertificateExpirationA TLS certificate for a platform hostname expires in less than one weekRenew the TLS certificate

The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.

Viewing active alerts

Alertmanager UI

Access Alertmanager to view active alerts:

https://alertmanager.<base-domain>

Prometheus UI

Query alerts in Prometheus:

https://prometheus.<base-domain>/alerts

CLI

$# View firing alerts
$kubectl exec -n astronomer prometheus-0 -- \
> wget -qO- localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Silencing alerts

Temporarily silence alerts during maintenance:

Via Alertmanager UI

  1. Go to https://alertmanager.<base-domain>
  2. Click Silences > New Silence
  3. Add matchers (for example, alertname=AirflowSchedulerUnhealthy)
  4. Set duration and comment
  5. Click Create

Via API

$curl -X POST https://alertmanager.<base-domain>/api/v2/silences \
> -H "Content-Type: application/json" \
> -d '{
> "matchers": [{"name": "alertname", "value": "AirflowSchedulerUnhealthy", "isRegex": false}],
> "startsAt": "2026-02-05T00:00:00Z",
> "endsAt": "2026-02-05T06:00:00Z",
> "createdBy": "admin",
> "comment": "Maintenance window"
> }'

Best practices

  1. Start with built-in alerts before creating custom ones
  2. Set appropriate thresholds - avoid alert fatigue
  3. Use severity levels - reserve critical for pages
  4. Include runbook links in alert descriptions
  5. Test alerts in non-production environments first
  6. Document escalation paths for each severity level