Scrape metrics from Remote Execution Agents

Airflow 3

This feature is only available for Airflow 3.x Deployments.

This guide lists available metrics when running Astro’s Remote Execution Agents and explains how to scrape metrics using OpenTelemetry and Prometheus.

What you can monitor

A self-managed Prometheus instance can collect the following classes of metrics from agent components:

Agent client metrics that each agent component exposes on its own /metrics endpoint, including a component_health gauge and heartbeat, agent proxy, and Python runtime metrics. The Dag processor also exposes parsing-pipeline metrics. See Agent client metrics for the full list.
Airflow application metrics that Airflow emits in StatsD format from the worker, Dag processor, and triggerer. Examples include scheduler heartbeats, task instance state counts, and Dag parse times.
Kubernetes infrastructure metrics for the agent Pods, such as CPU, memory, and Pod status. These come from kube-state-metrics, cAdvisor, and node-exporter on your cluster.
Sentinel runtime metrics Sentinel reports agent and integration health back to Astro, and you can scrape its Pod for local observability. Requires Sentinel to be enabled.

For metrics that Astro generates outside your cluster, such as orchestration plane scheduler activity, see Export metrics from Astro.

Prerequisites

A Remote Execution Agent (running Agent Client version 1.7.0 and above) installed in your Kubernetes cluster. See Register and configure agents.
A running Prometheus deployment in the same cluster, or one that can reach the agent namespace over the network. The Prometheus Operator with PodMonitor or ServiceMonitor resources is supported, and so is a standalone Prometheus that uses static scrape configs.
Cluster-level access to deploy Helm chart updates and, if you use the Prometheus Operator, to create custom resources.
kube-state-metrics and cAdvisor available in your cluster, if you want to collect Kubernetes infrastructure metrics.

Monitor Airflow application metrics

Airflow can expose metrics to OpenTelemetry or StatsD.

This guide covers a minimal monitoring example using OpenTelemetry. Features such as Kubernetes namespace isolation are out of scope for this guide. Code examples assume the components are deployed in a Kubernetes namespace named re.

See additional configuration here:

Step 1: Install OpenTelemetry Dependency

Add apache-airflow[otel] to requirements-client.txt and deploy the Remote Execution agents (using astro remote deploy).

Step 2: Install OpenTelemetry

This step installs an OpenTelemetry Collector on your Kubernetes cluster using a Helm chart. If you already have OpenTelemetry running, you can skip this step.

From your terminal, add the open-telemetry Helm repository and download the latest metadata:

1 helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
2 helm repo update

Create a YAML file named for example otel-values.yaml for OpenTelemetry configuration:

otel-values.yaml

1 mode: deployment
2 
3 image:
4   repository: otel/opentelemetry-collector-contrib
5 
6 command:
7   name: otelcol-contrib
8 
9 config:
10   receivers:
11     otlp:
12       protocols:
13         http:
14           endpoint: 0.0.0.0:4318
15 
16   exporters:
17     debug:
18       verbosity: detailed
19     prometheus:
20       endpoint: 0.0.0.0:8889
21 
22   service:
23     pipelines:
24       metrics:
25         receivers: [otlp]
26         exporters: [debug, prometheus]
27 
28 ports:
29   otlp-http:
30     enabled: true
31     containerPort: 4318
32     servicePort: 4318
33     protocol: TCP
34 
35   prometheus:
36     enabled: true
37     containerPort: 8889
38     servicePort: 8889
39     protocol: TCP

Install the OpenTelemetry Collector:

1 helm install otel-collector open-telemetry/opentelemetry-collector -n re -f otel-values.yaml

Step 3: Configure Airflow to ship metrics to OpenTelemetry

Configure these environment variables in your Remote Execution values.yaml under commonEnv:

values.yaml

1 commonEnv:
2   - name: AIRFLOW__METRICS__OTEL_ON
3     value: "True"
4   - name: OTEL_EXPORTER_OTLP_ENDPOINT
5     value: http://otel-collector-opentelemetry-collector.re.svc.cluster.local:4318
6   - name: OTEL_EXPORTER_OTLP_PROTOCOL
7     value: http/protobuf

This ensures Airflow services push metrics to the OTLP endpoint.

Upgrade your Remote Execution Helm chart:

1 helm upgrade astro-agent astronomer/astro-remote-execution-agent --values values.yaml

Step 4: Verify metrics

At this stage, Airflow application metrics should arrive in the OTLP collector. You can verify this by port forwarding the otel-collector:

1 kubectl port-forward -n re deploy/otel-collector-opentelemetry-collector 8889:8889

Browse to http://localhost:8889/metrics and check if you observe any metrics. Note that you might need to run some Airflow tasks for more metrics to show. Here’s an example of metrics that you might observe:

# HELP airflow_airflow_io_load_filesystems 
# TYPE airflow_airflow_io_load_filesystems gauge
airflow_airflow_io_load_filesystems{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 209.85527300035756
# HELP airflow_dag_example_astronauts_get_astronauts_duration 
# TYPE airflow_dag_example_astronauts_get_astronauts_duration gauge
airflow_dag_example_astronauts_get_astronauts_duration{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 3903.361
# HELP airflow_operator_successes_pythondecoratedoperator_total 
# TYPE airflow_operator_successes_pythondecoratedoperator_total counter
airflow_operator_successes_pythondecoratedoperator_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_operator_successes_total 
# TYPE airflow_operator_successes_total counter
airflow_operator_successes_total{dag_id="example_astronauts",job="airflow",operator="_PythonDecoratedOperator",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_serde_load_serializers 
# TYPE airflow_serde_load_serializers gauge
airflow_serde_load_serializers{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 0.8434970004600473
# HELP airflow_task_duration 
# TYPE airflow_task_duration gauge
airflow_task_duration{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 3903.361
# HELP airflow_ti_finish_example_astronauts_get_astronauts_success_total 
# TYPE airflow_ti_finish_example_astronauts_get_astronauts_success_total counter
airflow_ti_finish_example_astronauts_get_astronauts_success_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_finish_total 
# TYPE airflow_ti_finish_total counter
airflow_ti_finish_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",state="success",task_id="get_astronauts"} 1
# HELP airflow_ti_start_example_astronauts_get_astronauts_total 
# TYPE airflow_ti_start_example_astronauts_get_astronauts_total counter
airflow_ti_start_example_astronauts_get_astronauts_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_start_total 
# TYPE airflow_ti_start_total counter
airflow_ti_start_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_successes_total 
# TYPE airflow_ti_successes_total counter
airflow_ti_successes_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1

If you see metrics similar to the above, your setup is successful.

From here, you can configure your metrics backend such as Prometheus to scrape the metrics from otel-collector:8889. A common metrics/monitoring setup is Prometheus as the metrics backend, plus Grafana for visualization.

Agent client metrics

Each agent component (worker, Dag processor, and triggerer) runs an internal HTTP server that exposes Prometheus-format metrics directly on a /metrics endpoint, on port 39091 by default (configurable through the setting http_server.port, or using the environment variable ASTRO_AGENT_CLIENT_HTTP_SERVER__PORT). These metrics describe the agent client’s own runtime, including its health, its heartbeat traffic with the Astro orchestration plane, and the Python process it runs in. Agent client metrics are independent of the Airflow application metrics described in Monitor Airflow application metrics, and you can scrape both endpoints from the same Prometheus instance.

The most important health signal is the component_health gauge. Each internal subsystem reports 1 when healthy and 0 when unhealthy, for example:

component_health{component="TriggererHeartbeater"} 1.0
component_health{component="TriggererProc"} 1.0
component_health{component="Server"} 1.0

The component label values vary by agent client. For example, the worker reports its own set of subsystems, and the Dag processor reports subsystems for the parsing pipeline. Alert when any component_health series drops to 0.

If you are using Prometheus to scrape the metrics of your Agent clients, you can configure the Helm chart’s annotations with the following annotations to ensure that your metrics get collected:

values.yaml

1 # inside your values.yaml file
2 # make sure to update the port annotation if you changed the default value
3 annotations:
4   prometheus.io/scrape: "true"
5   prometheus.io/path: "/metrics"
6   prometheus.io/scheme: "http"
7   prometheus.io/port: "39091"

Generic metrics

All three agent clients ship a common set of metrics that cover Python runtime, process resources, heartbeat traffic with the API server, queue state, and the agent proxy.

Metric	Type	Description
`component_health`	Gauge	Health status of agent subsystems. `1` is healthy and `0` is unhealthy.
`heartbeat_requests_total`	Counter	Total heartbeat attempts, with `component` and `outcome` labels (`success`, `timeout`, `error`).
`heartbeat_duration_seconds`	Histogram	Time from the start of a heartbeat request to the receipt of the provider response.
`heartbeat_payload_bytes`	Histogram	Size in bytes of the serialized heartbeat request body before it is sent.
`heartbeat_requests_received_total`	Counter	Total heartbeat requests received from tasks.
`heartbeat_requests_sent_total`	Counter	Total heartbeat requests sent to the API server.
`heartbeat_requests_error_total`	Counter	Total heartbeat requests that failed to be sent to the API server.
`matched_proxy_errors_total`	Counter	Total proxy failures for matched routes, with `route` and `error_type` labels.
`astro_agent_client_queue_stats`	Gauge	Number of tasks in each state in each queue.
`astro_agent_proxy_http_requests_total`	Counter	Total agent proxy requests, with `method`, `status`, and `handler` labels.
`astro_agent_proxy_http_request_duration_seconds`	Histogram	Agent proxy request latency, with `handler` and `method` labels. Use this when aggregation by handler matters.
`astro_agent_proxy_http_request_duration_highr_seconds`	Histogram	High-resolution agent proxy request latency for accurate percentile calculations.
`astro_agent_proxy_http_request_size_bytes`	Summary	Content length of incoming agent proxy requests, by `handler`.
`astro_agent_proxy_http_response_size_bytes`	Summary	Content length of outgoing agent proxy responses, by `handler`.
`python_info`	Gauge	Python platform information, including the interpreter implementation and version labels.
`python_gc_objects_collected_total`	Counter	Objects collected during garbage collection, by `generation`.
`python_gc_objects_uncollectable_total`	Counter	Uncollectable objects found during garbage collection, by `generation`.
`python_gc_collections_total`	Counter	Number of times each generation was collected.
`process_cpu_seconds_total`	Counter	Total user and system CPU time, in seconds.
`process_virtual_memory_bytes`	Gauge	Virtual memory size, in bytes.
`process_resident_memory_bytes`	Gauge	Resident memory size, in bytes.
`process_start_time_seconds`	Gauge	Start time of the process since the Unix epoch, in seconds.
`process_open_fds`	Gauge	Number of open file descriptors.
`process_max_fds`	Gauge	Maximum number of open file descriptors.

Dag processor metrics

In addition to the generic metrics, the Dag processor exposes parsing-pipeline metrics.

Metric	Type	Description
`dag_processor_heartbeat_dags_sent_total`	Counter	Total Dags included in heartbeat requests.
`dag_processor_results_queue_depth`	Gauge	Number of parsed Dag results waiting in the coordinator queue. Use it as a back-pressure indicator.
`dag_processor_cache_hits_total`	Counter	Total Dags skipped because their `dag_hash` matched the cached value.
`dag_processor_cache_misses_total`	Counter	Total Dags processed because they were new, changed, or cold in the cache.
`dag_processor_cache_size`	Gauge	Number of entries in the Dag processor `dag_hashes` cache.

Note that the heartbeat_* metrics described earlier in the Generic Metrics section are also available for the Dag Processor component, and can be filtered by component=dag_processor.