Enable data plane failover | Astronomer Documentation

This guide walks you through enabling data plane failover on an existing Astro Private Cloud (APC) installation. You configure the control plane and each participating data plane cluster separately.

For a conceptual overview of the feature and its components, see Data plane failover.

Prerequisites

A working APC installation with at least one control plane cluster (global.plane.mode: control) and at least two data plane clusters (global.plane.mode: data).
A database server hostname that is network-accessible from both the source and destination data plane clusters. APC provisions the logical databases automatically, but the server itself must be reachable from both clusters. For supported topologies, see Database requirements.
An external secrets store supported for APC data plane failover. APC currently supports AWS Secrets Manager and Google Cloud Secret Manager through the External Secrets Operator (ESO).
A ClusterSecretStore custom resource configured in each data plane cluster that points to the same external secrets store. The name of this resource is the value you provide for global.dataPlaneFailover.externalSecretManagerName.
A single externally managed container registry endpoint, configured on the control plane, that serves the Apache Airflow Deployment images used by your Deployments to every region where a data plane may run, with the same repository paths and tags in each region. Every data plane cluster must be able to pull from this endpoint. APC currently supports only one registry endpoint per APC installation. For details, see Container registry requirements. To configure the registry backend, see Use a registry backend.
An external sink for Airflow logs — an external Elasticsearch instance — that is reachable from every data plane cluster, so that task logs remain accessible after a Deployment moves between data planes. For supported topologies, see Airflow log sink requirements. For details on how Astro Private Cloud collects and exports task logs to Elasticsearch, see Configure task log collection and exporting to ElasticSearch.
Helm 3.6 or later.
A Kubernetes ClusterRole for the identity running helm install or helm upgrade on each data plane cluster. ESO installs cluster-scoped CRDs, which require cluster-level permissions.
Access to your APC Helm values files.

Configure the ClusterSecretStore

Before applying the Helm values, create a ClusterSecretStore and its backing credentials secret in each data plane cluster. APC uses the ClusterSecretStore to push and pull Airflow secrets between data planes during failover.

The following example uses AWS Secrets Manager. Substitute the values for your environment and secrets store provider.

Create a Kubernetes secret that holds your AWS credentials in each data plane cluster:

1 apiVersion: v1
2 kind: Secret
3 type: Opaque
4 metadata:
5   name: secrets-backend-credentials
6   namespace: astronomer
7 data:
8   access-key: <base64-encoded-aws-access-key-id>
9   secret-access-key: <base64-encoded-aws-secret-access-key>

Create the ClusterSecretStore in each data plane cluster:

1 apiVersion: external-secrets.io/v1
2 kind: ClusterSecretStore
3 metadata:
4   name: astronomer-cluster-secret-store
5   namespace: astronomer
6 spec:
7   provider:
8     aws:
9       service: SecretsManager
10       region: <aws-region>
11       auth:
12         secretRef:
13           accessKeyIDSecretRef:
14             namespace: astronomer
15             name: secrets-backend-credentials
16             key: access-key
17           secretAccessKeySecretRef:
18             namespace: astronomer
19             name: secrets-backend-credentials
20             key: secret-access-key

The value you use for metadata.name is the value you provide for global.dataPlaneFailover.externalSecretManagerName in your Helm values.

Step 1: Configure the control plane

Add the following values to your control plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Navigator, DP-Link, and the APC API dispatcher when global.plane.mode is control.

1 global:
2   dataPlaneFailover:
3     enabled: true
4     externalSecretManagerName: <your-cluster-secret-store-name>
5 external-secrets:
6   enabled: false

Replace <your-cluster-secret-store-name> with the name of the ClusterSecretStore custom resource in your data plane clusters.

ESO isn’t required on the control plane. Don’t set external-secrets.enabled: true in your control plane values.

Step 2: Configure each data plane

Add the following values to each data plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Pilot and the Flightdeck database bootstrap when global.plane.mode is data.

1 global:
2   dataPlaneFailover:
3     enabled: true
4     externalSecretManagerName: <your-cluster-secret-store-name>
5 external-secrets:
6   enabled: true

Use the same value for externalSecretManagerName as on the control plane. Both clusters must reference the same ClusterSecretStore.

The external-secrets key enables the bundled ESO subchart, which installs cluster-scoped CRDs. The identity running helm upgrade must have a ClusterRole on the data plane cluster. If you already run ESO separately, set external-secrets.enabled: false and ensure your existing ESO installation recognizes the ClusterSecretStore that APC expects.

The deployment orchestrator bootstraps the Flightdeck database as an init container during startup. If the bootstrap fails, the deployment orchestrator Pod doesn’t start. Check the flightdeck-bootstrapper and flightdeck-db-migrations init container logs if the deployment orchestrator fails to come up after enabling this feature.

Step 3: Apply the changes

Apply the updated values to each cluster using helm upgrade. Upgrade the control plane first.

$ helm upgrade <release-name> astronomer/astronomer \
>   --namespace <namespace> \
>   --values values.yaml \
>   --version <chart-version>

Run the same command for each data plane cluster, substituting the appropriate release name, namespace, and values file.

Step 4: Verify the deployment

After the upgrade completes, confirm that the new components are running on each cluster.

On the control plane, verify that the following Pods are running:

$ kubectl get pods -n <namespace> | grep -E "navigator|dp-link|houston"

On each data plane, verify that the deployment orchestrator started successfully and Pilot is running:

$ kubectl get pods -n <namespace> | grep -E "commander|pilot"

Check deployment orchestrator logs to confirm Flightdeck initialized correctly:

$ kubectl logs -n <namespace> deployment/<release-name>-commander \
>   -c flightdeck-bootstrapper

Advanced configuration

Changing any of the values in this section can meaningfully affect resource usage on your Kubernetes clusters and may adversely affect failover functionality. Change and test these values in a non-production environment before applying them to production.

Tune Pilot behavior

Pilot’s claim, retry, and circuit breaker behavior is configurable via environment variables. Set these under astronomer.pilot.env in your data plane values.yaml.

Environment variable	Default	Description
`PILOT_MAX_INFLIGHT_PER_WORKER`	`5`	Maximum number of flights Pilot executes concurrently per worker.
`PILOT_CLAIM_POLL_INTERVAL_MS`	`5000`	How often Pilot polls for new flights, in milliseconds.
`PILOT_LEASE_TTL_SECONDS`	`60`	How long a claimed flight lease is valid before expiring.
`PILOT_MAX_ATTEMPTS_PER_FLIGHT`	`15`	Maximum number of execution attempts before a flight is marked as failed.
`PILOT_RETRY_BASE_INTERVAL_MS`	`250`	Base delay between retry attempts, in milliseconds.
`PILOT_RETRY_MAX_INTERVAL_MS`	`5000`	Maximum delay between retry attempts, in milliseconds.
`PILOT_CB_FAILURE_THRESHOLD`	`10`	Number of consecutive failures before the circuit breaker opens.
`PILOT_CB_COOLOFF_SECONDS`	`30`	How long the circuit breaker remains open before allowing probe attempts, in seconds.

For data planes with a larger number of Airflow Deployments (roughly 50 or more), or for cross-region failovers where each Deployment takes longer to come up because the deployment orchestrator has to pull container images from a remote-region registry endpoint or fetch secrets from a remote-region secrets backend, consider raising PILOT_MAX_INFLIGHT_PER_WORKER above the default of 5. A higher value lets Pilot bring more Deployments up on the destination cluster in parallel, which reduces overall failover time and helps amortize cross-region latency. Each in-flight flight runs additional work on the data plane cluster (secret syncs, Helm installs, and database operations) and consumes additional bandwidth to the registry and secrets store, so only raise this value if your data plane cluster has spare CPU, memory, and API server headroom and your registry/secrets backends can handle the extra concurrent traffic. Validate the new value in a non-production environment first.

Tune Navigator behavior

Navigator’s reconcile loop timing is configurable via environment variables. Set these under astronomer.navigator.env in your control plane values.yaml.

Environment variable	Default	Description
`FAILOVER_REQUEST_RECONCILER_INTERVAL_SECONDS`	`10`	How often Navigator checks for new or in-progress failover requests, in seconds.
`MISSION_CLAIM_MIN_BATCH_SIZE`	`10`	Minimum number of missions Navigator claims per reconcile cycle.
`MISSION_PLAN_BATCH_SIZE`	`5`	Number of missions Navigator plans concurrently.
`MISSION_RECONCILE_BATCH_SIZE`	`5`	Number of missions Navigator reconciles concurrently.
`CLAIM_INTERVAL_SECONDS`	`10`	Interval between mission claim cycles, in seconds.

Tune DP-Link health thresholds

DP-Link determines cluster health based on heartbeat age. Adjust these thresholds under astronomer.dpLink.env in your control plane values.yaml.

Environment variable	Default	Description
`UNREACHABLE_THRESHOLD_SECONDS`	`90`	Age of the last heartbeat, in seconds, after which a cluster is marked `UNREACHABLE`.
`DEGRADED_THRESHOLD_SECONDS`	`30`	Age of the last heartbeat, in seconds, after which a cluster is marked `DEGRADED`.

Tune the APC API dispatcher behavior

The APC API dispatcher dispatches flights from the control plane to the deployment orchestrator on each data plane. Its loop timing, concurrency, retry, and circuit breaker behavior are configurable through environment variables. Set these under astronomer.houston.env in your control plane values.yaml.

Dispatcher loop

Environment variable	Default	Description
`DISPATCH_LEASE_TTL_SECONDS`	`30`	How long a dispatcher lease on a flight is valid before expiring, in seconds.
`DISPATCH_BATCH_SIZE`	`50`	Maximum number of flights the dispatcher claims per poll cycle.
`DISPATCH_MAX_INFLIGHT`	`50`	Maximum number of flights the dispatcher executes concurrently across all data planes.
`DISPATCH_MAX_INFLIGHT_PER_DP`	`5`	Maximum number of flights the dispatcher executes concurrently against a single data plane.
`DISPATCH_POLL_SECONDS`	`10`	How often the dispatcher polls for new flights, in seconds.
`DISPATCH_MAX_ATTEMPTS_PER_LEASE`	`5`	Maximum number of in-process retries for a flight while the dispatcher holds its lease.
`DISPATCH_MAX_ATTEMPTS_PER_FLIGHT`	`25`	Maximum durable retries for a flight across all dispatcher processes before it is marked as failed.
`DISPATCH_RETRY_COOLOFF_PERIOD`	`60`	Cool-off, in seconds, applied after a flight crosses `DISPATCH_MAX_ATTEMPTS_PER_FLIGHT`.
`IN_REGION_STARTFLIGHT_RPC_TIMEOUT`	`5000`	Timeout for `StartFlight` RPCs to a data plane in the same region as the control plane, in milliseconds.
`CROSS_REGION_STARTFLIGHT_RPC_TIMEOUT`	`12000`	Timeout for `StartFlight` RPCs to a data plane in a different region from the control plane, in milliseconds.

Circuit breaker

Environment variable	Default	Description
`CB_FAILURE_THRESHOLD`	`10`	Number of consecutive `StartFlight` failures to a data plane before the circuit breaker opens.
`CB_COOLOFF_SECONDS`	`30`	How long the circuit breaker remains open before allowing probe attempts, in seconds.
`CB_PROBE_MAX_INFLIGHT`	`1`	Maximum number of probe `StartFlight` calls allowed while the circuit breaker is half-open.