Scrape metrics from Remote Execution Agents
Airflow 3
This feature is only available for Airflow 3.x Deployments.This guide lists available metrics when running Astro’s Remote Execution Agents and explains how to scrape metrics using OpenTelemetry and Prometheus.
What you can monitor
A self-managed Prometheus instance can collect the following classes of metrics from agent components:
- Agent client metrics that each agent component exposes on its own
/metricsendpoint, including acomponent_healthgauge and heartbeat, agent proxy, and Python runtime metrics. The Dag processor also exposes parsing-pipeline metrics. See Agent client metrics for the full list. - Airflow application metrics that Airflow emits in StatsD format from the worker, Dag processor, and triggerer. Examples include scheduler heartbeats, task instance state counts, and Dag parse times.
- Kubernetes infrastructure metrics for the agent Pods, such as CPU, memory, and Pod status. These come from
kube-state-metrics,cAdvisor, andnode-exporteron your cluster. - Sentinel runtime metrics Sentinel reports agent and integration health back to Astro, and you can scrape its Pod for local observability. Requires Sentinel to be enabled.
For metrics that Astro generates outside your cluster, such as orchestration plane scheduler activity, see Export metrics from Astro.
Prerequisites
- A Remote Execution Agent (running Agent Client version
1.7.0and above) installed in your Kubernetes cluster. See Register and configure agents. - A running Prometheus deployment in the same cluster, or one that can reach the agent namespace over the network. The Prometheus Operator with
PodMonitororServiceMonitorresources is supported, and so is a standalone Prometheus that uses static scrape configs. - Cluster-level access to deploy Helm chart updates and, if you use the Prometheus Operator, to create custom resources.
kube-state-metricsandcAdvisoravailable in your cluster, if you want to collect Kubernetes infrastructure metrics.
Monitor Airflow application metrics
Airflow can expose metrics to OpenTelemetry or StatsD.
This guide covers a minimal monitoring example using OpenTelemetry. Features such as Kubernetes namespace isolation are out of scope for this guide. Code examples assume the components are deployed in a Kubernetes namespace named re.
See additional configuration here:
Step 1: Install OpenTelemetry Dependency
Add apache-airflow[otel] to requirements-client.txt and deploy the Remote Execution agents (using astro remote deploy).
Step 2: Install OpenTelemetry
This step installs an OpenTelemetry Collector on your Kubernetes cluster using a Helm chart. If you already have OpenTelemetry running, you can skip this step.
From your terminal, add the open-telemetry Helm repository and download the latest metadata:
Create a YAML file named for example otel-values.yaml for OpenTelemetry configuration:
Install the OpenTelemetry Collector:
Step 3: Configure Airflow to ship metrics to OpenTelemetry
Configure these environment variables in your Remote Execution values.yaml under commonEnv:
This ensures Airflow services push metrics to the OTLP endpoint.
Upgrade your Remote Execution Helm chart:
Step 4: Verify metrics
At this stage, Airflow application metrics should arrive in the OTLP collector. You can verify this by port forwarding the otel-collector:
Browse to http://localhost:8889/metrics and check if you observe any metrics. Note that you might need to run some Airflow tasks for more metrics to show. Here’s an example of metrics that you might observe:
If you see metrics similar to the above, your setup is successful.
From here, you can configure your metrics backend such as Prometheus to scrape the metrics from otel-collector:8889. A common metrics/monitoring setup is Prometheus as the metrics backend, plus Grafana for visualization.
Agent client metrics
Each agent component (worker, Dag processor, and triggerer) runs an internal HTTP server that exposes Prometheus-format metrics directly on a /metrics endpoint, on port 39091 by default (configurable through the setting http_server.port, or using the environment variable ASTRO_AGENT_CLIENT_HTTP_SERVER__PORT). These metrics describe the agent client’s own runtime, including its health, its heartbeat traffic with the Astro orchestration plane, and the Python process it runs in. Agent client metrics are independent of the Airflow application metrics described in Monitor Airflow application metrics, and you can scrape both endpoints from the same Prometheus instance.
The most important health signal is the component_health gauge. Each internal subsystem reports 1 when healthy and 0 when unhealthy, for example:
The component label values vary by agent client. For example, the worker reports its own set of subsystems, and the Dag processor reports subsystems for the parsing pipeline. Alert when any component_health series drops to 0.
If you are using Prometheus to scrape the metrics of your Agent clients, you can configure the Helm chart’s annotations with the following annotations to ensure that your metrics get collected:
Generic metrics
All three agent clients ship a common set of metrics that cover Python runtime, process resources, heartbeat traffic with the API server, queue state, and the agent proxy.
Dag processor metrics
In addition to the generic metrics, the Dag processor exposes parsing-pipeline metrics.
Note that the heartbeat_* metrics described earlier in the Generic Metrics section are also available for the Dag Processor component, and can be filtered by component=dag_processor.