Platform and deployment alerts
APC includes two built-in alerting systems for monitoring health:
- Deployment-level alerts: Notify you when an Airflow Deployment is unhealthy or components are underperforming.
- Platform-level alerts: Notify you when APC platform components are unhealthy (Elasticsearch, Houston API, Registry, Commander).
Alerts fire based on metrics collected by Prometheus. When alert conditions are met, Prometheus Alertmanager sends notifications to your configured channels.
Alertmanager is enabled by default as part of the APC monitoring stack (tags.monitoring: true). To disable it individually, set global.alertmanagerEnabled: false in your values.yaml. See Apply platform configuration for details.
Alert architecture
Anatomy of an alert
Alerts are defined in YAML using PromQL queries:
Subscribe to alerts
Configure alert receivers
Alertmanager uses receivers to integrate with notification platforms. Define receivers in your values.yaml:
Email alerts
Slack alerts
PagerDuty alerts
OpsGenie alerts
Default receiver groups
APC includes default receiver groups based on tier and severity:
Custom routes
If you define a platform, platformCritical, or airflow receiver, you don’t need a customRoute to route to it — alerts are automatically routed based on the tier label. Use customRoutes only for non-default routing (for example, high-severity Deployment alerts):
Custom receivers
Use alertmanager.customReceiver to define receivers for notification services not covered by the built-in receiver keys. Custom receivers work alongside customRoutes to route alerts to those services:
Apply configuration
Push receiver configuration to your installation:
Create custom alerts
Add custom alerts using the Prometheus Helm chart:
Platform alert example
Alert when multiple schedulers are unhealthy:
Deployment alert example
Alert on high task failure rate:
Built-in deployment alerts
For a complete list of built-in alerts, see the Prometheus alerts configmap.
Built-in platform alerts
The ElasticSeachUnassignedShards and IngessCertificateExpiration alert names contain typos in their current implementation. Use the exact names shown when creating silences or custom routes.
Viewing active alerts
Alertmanager UI
Access Alertmanager to view active alerts:
Prometheus UI
Query alerts in Prometheus:
CLI
Silencing alerts
Temporarily silence alerts during maintenance:
Via Alertmanager UI
- Go to
https://alertmanager.<base-domain> - Click Silences > New Silence
- Add matchers (for example,
alertname=AirflowSchedulerUnhealthy) - Set duration and comment
- Click Create
Via API
Best practices
- Start with built-in alerts before creating custom ones
- Set appropriate thresholds - avoid alert fatigue
- Use severity levels - reserve
criticalfor pages - Include runbook links in alert descriptions
- Test alerts in non-production environments first
- Document escalation paths for each severity level