Scrape metrics from Remote Execution Agents

Airflow 3
This feature is only available for Airflow 3.x Deployments.

This guide lists available metrics when running Astro’s Remote Execution Agents and explains how to scrape metrics using OpenTelemetry and Prometheus.

What you can monitor

A self-managed Prometheus instance can collect the following classes of metrics from agent components:

  • Agent client metrics that each agent component exposes on its own /metrics endpoint, including a component_health gauge and heartbeat, agent proxy, and Python runtime metrics. The Dag processor also exposes parsing-pipeline metrics. See Agent client metrics for the full list.
  • Airflow application metrics that Airflow emits in StatsD format from the worker, Dag processor, and triggerer. Examples include scheduler heartbeats, task instance state counts, and Dag parse times.
  • Kubernetes infrastructure metrics for the agent Pods, such as CPU, memory, and Pod status. These come from kube-state-metrics, cAdvisor, and node-exporter on your cluster.
  • Sentinel runtime metrics Sentinel reports agent and integration health back to Astro, and you can scrape its Pod for local observability. Requires Sentinel to be enabled.

For metrics that Astro generates outside your cluster, such as orchestration plane scheduler activity, see Export metrics from Astro.

Prerequisites

  • A Remote Execution Agent (running Agent Client version 1.7.0 and above) installed in your Kubernetes cluster. See Register and configure agents.
  • A running Prometheus deployment in the same cluster, or one that can reach the agent namespace over the network. The Prometheus Operator with PodMonitor or ServiceMonitor resources is supported, and so is a standalone Prometheus that uses static scrape configs.
  • Cluster-level access to deploy Helm chart updates and, if you use the Prometheus Operator, to create custom resources.
  • kube-state-metrics and cAdvisor available in your cluster, if you want to collect Kubernetes infrastructure metrics.

Monitor Airflow application metrics

Airflow can expose metrics to OpenTelemetry or StatsD.

This guide covers a minimal monitoring example using OpenTelemetry. Features such as Kubernetes namespace isolation are out of scope for this guide. Code examples assume the components are deployed in a Kubernetes namespace named re.

See additional configuration here:

Step 1: Install OpenTelemetry Dependency

Add apache-airflow[otel] to requirements-client.txt and deploy the Remote Execution agents (using astro remote deploy).

Step 2: Install OpenTelemetry

This step installs an OpenTelemetry Collector on your Kubernetes cluster using a Helm chart. If you already have OpenTelemetry running, you can skip this step.

From your terminal, add the open-telemetry Helm repository and download the latest metadata:

1helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
2helm repo update

Create a YAML file named for example otel-values.yaml for OpenTelemetry configuration:

1mode: deployment
2
3image:
4 repository: otel/opentelemetry-collector-contrib
5
6command:
7 name: otelcol-contrib
8
9config:
10 receivers:
11 otlp:
12 protocols:
13 http:
14 endpoint: 0.0.0.0:4318
15
16 exporters:
17 debug:
18 verbosity: detailed
19 prometheus:
20 endpoint: 0.0.0.0:8889
21
22 service:
23 pipelines:
24 metrics:
25 receivers: [otlp]
26 exporters: [debug, prometheus]
27
28ports:
29 otlp-http:
30 enabled: true
31 containerPort: 4318
32 servicePort: 4318
33 protocol: TCP
34
35 prometheus:
36 enabled: true
37 containerPort: 8889
38 servicePort: 8889
39 protocol: TCP

Install the OpenTelemetry Collector:

1helm install otel-collector open-telemetry/opentelemetry-collector -n re -f otel-values.yaml

Step 3: Configure Airflow to ship metrics to OpenTelemetry

Configure these environment variables in your Remote Execution values.yaml under commonEnv:

1commonEnv:
2 - name: AIRFLOW__METRICS__OTEL_ON
3 value: "True"
4 - name: OTEL_EXPORTER_OTLP_ENDPOINT
5 value: http://otel-collector-opentelemetry-collector.re.svc.cluster.local:4318
6 - name: OTEL_EXPORTER_OTLP_PROTOCOL
7 value: http/protobuf

This ensures Airflow services push metrics to the OTLP endpoint.

Upgrade your Remote Execution Helm chart:

1helm upgrade astro-agent astronomer/astro-remote-execution-agent --values values.yaml

Step 4: Verify metrics

At this stage, Airflow application metrics should arrive in the OTLP collector. You can verify this by port forwarding the otel-collector:

1kubectl port-forward -n re deploy/otel-collector-opentelemetry-collector 8889:8889

Browse to http://localhost:8889/metrics and check if you observe any metrics. Note that you might need to run some Airflow tasks for more metrics to show. Here’s an example of metrics that you might observe:

# HELP airflow_airflow_io_load_filesystems
# TYPE airflow_airflow_io_load_filesystems gauge
airflow_airflow_io_load_filesystems{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 209.85527300035756
# HELP airflow_dag_example_astronauts_get_astronauts_duration
# TYPE airflow_dag_example_astronauts_get_astronauts_duration gauge
airflow_dag_example_astronauts_get_astronauts_duration{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 3903.361
# HELP airflow_operator_successes_pythondecoratedoperator_total
# TYPE airflow_operator_successes_pythondecoratedoperator_total counter
airflow_operator_successes_pythondecoratedoperator_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_operator_successes_total
# TYPE airflow_operator_successes_total counter
airflow_operator_successes_total{dag_id="example_astronauts",job="airflow",operator="_PythonDecoratedOperator",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_serde_load_serializers
# TYPE airflow_serde_load_serializers gauge
airflow_serde_load_serializers{job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version=""} 0.8434970004600473
# HELP airflow_task_duration
# TYPE airflow_task_duration gauge
airflow_task_duration{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 3903.361
# HELP airflow_ti_finish_example_astronauts_get_astronauts_success_total
# TYPE airflow_ti_finish_example_astronauts_get_astronauts_success_total counter
airflow_ti_finish_example_astronauts_get_astronauts_success_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_finish_total
# TYPE airflow_ti_finish_total counter
airflow_ti_finish_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",state="success",task_id="get_astronauts"} 1
# HELP airflow_ti_start_example_astronauts_get_astronauts_total
# TYPE airflow_ti_start_example_astronauts_get_astronauts_total counter
airflow_ti_start_example_astronauts_get_astronauts_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_start_total
# TYPE airflow_ti_start_total counter
airflow_ti_start_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1
# HELP airflow_ti_successes_total
# TYPE airflow_ti_successes_total counter
airflow_ti_successes_total{dag_id="example_astronauts",job="airflow",otel_scope_name="airflow.sdk._shared.observability.metrics.otel_logger",otel_scope_schema_url="",otel_scope_version="",task_id="get_astronauts"} 1

If you see metrics similar to the above, your setup is successful.

From here, you can configure your metrics backend such as Prometheus to scrape the metrics from otel-collector:8889. A common metrics/monitoring setup is Prometheus as the metrics backend, plus Grafana for visualization.

Agent client metrics

Each agent component (worker, Dag processor, and triggerer) runs an internal HTTP server that exposes Prometheus-format metrics directly on a /metrics endpoint, on port 39091 by default (configurable through the setting http_server.port, or using the environment variable ASTRO_AGENT_CLIENT_HTTP_SERVER__PORT). These metrics describe the agent client’s own runtime, including its health, its heartbeat traffic with the Astro orchestration plane, and the Python process it runs in. Agent client metrics are independent of the Airflow application metrics described in Monitor Airflow application metrics, and you can scrape both endpoints from the same Prometheus instance.

The most important health signal is the component_health gauge. Each internal subsystem reports 1 when healthy and 0 when unhealthy, for example:

component_health{component="TriggererHeartbeater"} 1.0
component_health{component="TriggererProc"} 1.0
component_health{component="Server"} 1.0

The component label values vary by agent client. For example, the worker reports its own set of subsystems, and the Dag processor reports subsystems for the parsing pipeline. Alert when any component_health series drops to 0.

If you are using Prometheus to scrape the metrics of your Agent clients, you can configure the Helm chart’s annotations with the following annotations to ensure that your metrics get collected:

1# inside your values.yaml file
2# make sure to update the port annotation if you changed the default value
3annotations:
4 prometheus.io/scrape: "true"
5 prometheus.io/path: "/metrics"
6 prometheus.io/scheme: "http"
7 prometheus.io/port: "39091"

Generic metrics

All three agent clients ship a common set of metrics that cover Python runtime, process resources, heartbeat traffic with the API server, queue state, and the agent proxy.

MetricTypeDescription
component_healthGaugeHealth status of agent subsystems. 1 is healthy and 0 is unhealthy.
heartbeat_requests_totalCounterTotal heartbeat attempts, with component and outcome labels (success, timeout, error).
heartbeat_duration_secondsHistogramTime from the start of a heartbeat request to the receipt of the provider response.
heartbeat_payload_bytesHistogramSize in bytes of the serialized heartbeat request body before it is sent.
heartbeat_requests_received_totalCounterTotal heartbeat requests received from tasks.
heartbeat_requests_sent_totalCounterTotal heartbeat requests sent to the API server.
heartbeat_requests_error_totalCounterTotal heartbeat requests that failed to be sent to the API server.
matched_proxy_errors_totalCounterTotal proxy failures for matched routes, with route and error_type labels.
astro_agent_client_queue_statsGaugeNumber of tasks in each state in each queue.
astro_agent_proxy_http_requests_totalCounterTotal agent proxy requests, with method, status, and handler labels.
astro_agent_proxy_http_request_duration_secondsHistogramAgent proxy request latency, with handler and method labels. Use this when aggregation by handler matters.
astro_agent_proxy_http_request_duration_highr_secondsHistogramHigh-resolution agent proxy request latency for accurate percentile calculations.
astro_agent_proxy_http_request_size_bytesSummaryContent length of incoming agent proxy requests, by handler.
astro_agent_proxy_http_response_size_bytesSummaryContent length of outgoing agent proxy responses, by handler.
python_infoGaugePython platform information, including the interpreter implementation and version labels.
python_gc_objects_collected_totalCounterObjects collected during garbage collection, by generation.
python_gc_objects_uncollectable_totalCounterUncollectable objects found during garbage collection, by generation.
python_gc_collections_totalCounterNumber of times each generation was collected.
process_cpu_seconds_totalCounterTotal user and system CPU time, in seconds.
process_virtual_memory_bytesGaugeVirtual memory size, in bytes.
process_resident_memory_bytesGaugeResident memory size, in bytes.
process_start_time_secondsGaugeStart time of the process since the Unix epoch, in seconds.
process_open_fdsGaugeNumber of open file descriptors.
process_max_fdsGaugeMaximum number of open file descriptors.

Dag processor metrics

In addition to the generic metrics, the Dag processor exposes parsing-pipeline metrics.

MetricTypeDescription
dag_processor_heartbeat_dags_sent_totalCounterTotal Dags included in heartbeat requests.
dag_processor_results_queue_depthGaugeNumber of parsed Dag results waiting in the coordinator queue. Use it as a back-pressure indicator.
dag_processor_cache_hits_totalCounterTotal Dags skipped because their dag_hash matched the cached value.
dag_processor_cache_misses_totalCounterTotal Dags processed because they were new, changed, or cold in the cache.
dag_processor_cache_sizeGaugeNumber of entries in the Dag processor dag_hashes cache.

Note that the heartbeat_* metrics described earlier in the Generic Metrics section are also available for the Dag Processor component, and can be filtered by component=dag_processor.