Enable data plane failover

This guide walks you through enabling data plane failover on an existing Astro Private Cloud (APC) installation. You configure the control plane and each participating data plane cluster separately.

For a conceptual overview of the feature and its components, see Data plane failover.

Prerequisites

  • A working APC installation with at least one control plane cluster (global.plane.mode: control) and at least two data plane clusters (global.plane.mode: data).
  • A database server hostname that is network-accessible from both the source and destination data plane clusters. APC provisions the logical databases automatically, but the server itself must be reachable from both clusters. For supported topologies, see Database requirements.
  • An external secrets store supported for APC data plane failover. APC currently supports AWS Secrets Manager and Google Cloud Secret Manager through the External Secrets Operator (ESO).
  • A ClusterSecretStore custom resource configured in each data plane cluster that points to the same external secrets store. The name of this resource is the value you provide for global.dataPlaneFailover.externalSecretManagerName.
  • A single externally managed container registry endpoint, configured on the control plane, that serves the Apache Airflow Deployment images used by your Deployments to every region where a data plane may run, with the same repository paths and tags in each region. Every data plane cluster must be able to pull from this endpoint. APC currently supports only one registry endpoint per APC installation. For details, see Container registry requirements. To configure the registry backend, see Use a registry backend.
  • An external sink for Airflow logs — an external Elasticsearch instance — that is reachable from every data plane cluster, so that task logs remain accessible after a Deployment moves between data planes. For supported topologies, see Airflow log sink requirements. For details on how Astro Private Cloud collects and exports task logs to Elasticsearch, see Configure task log collection and exporting to ElasticSearch.
  • Helm 3.6 or later.
  • A Kubernetes ClusterRole for the identity running helm install or helm upgrade on each data plane cluster. ESO installs cluster-scoped CRDs, which require cluster-level permissions.
  • Access to your APC Helm values files.

Configure the ClusterSecretStore

Before applying the Helm values, create a ClusterSecretStore and its backing credentials secret in each data plane cluster. APC uses the ClusterSecretStore to push and pull Airflow secrets between data planes during failover.

The following example uses AWS Secrets Manager. Substitute the values for your environment and secrets store provider.

  1. Create a Kubernetes secret that holds your AWS credentials in each data plane cluster:
1apiVersion: v1
2kind: Secret
3type: Opaque
4metadata:
5 name: secrets-backend-credentials
6 namespace: astronomer
7data:
8 access-key: <base64-encoded-aws-access-key-id>
9 secret-access-key: <base64-encoded-aws-secret-access-key>
  1. Create the ClusterSecretStore in each data plane cluster:
1apiVersion: external-secrets.io/v1
2kind: ClusterSecretStore
3metadata:
4 name: astronomer-cluster-secret-store
5 namespace: astronomer
6spec:
7 provider:
8 aws:
9 service: SecretsManager
10 region: <aws-region>
11 auth:
12 secretRef:
13 accessKeyIDSecretRef:
14 namespace: astronomer
15 name: secrets-backend-credentials
16 key: access-key
17 secretAccessKeySecretRef:
18 namespace: astronomer
19 name: secrets-backend-credentials
20 key: secret-access-key

The value you use for metadata.name is the value you provide for global.dataPlaneFailover.externalSecretManagerName in your Helm values.

Step 1: Configure the control plane

Add the following values to your control plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Navigator, DP-Link, and the APC API dispatcher when global.plane.mode is control.

1global:
2 dataPlaneFailover:
3 enabled: true
4 externalSecretManagerName: <your-cluster-secret-store-name>
5external-secrets:
6 enabled: false

Replace <your-cluster-secret-store-name> with the name of the ClusterSecretStore custom resource in your data plane clusters.

ESO isn’t required on the control plane. Don’t set external-secrets.enabled: true in your control plane values.

Step 2: Configure each data plane

Add the following values to each data plane values.yaml. Setting global.dataPlaneFailover.enabled: true activates Pilot and the Flightdeck database bootstrap when global.plane.mode is data.

1global:
2 dataPlaneFailover:
3 enabled: true
4 externalSecretManagerName: <your-cluster-secret-store-name>
5external-secrets:
6 enabled: true

Use the same value for externalSecretManagerName as on the control plane. Both clusters must reference the same ClusterSecretStore.

The external-secrets key enables the bundled ESO subchart, which installs cluster-scoped CRDs. The identity running helm upgrade must have a ClusterRole on the data plane cluster. If you already run ESO separately, set external-secrets.enabled: false and ensure your existing ESO installation recognizes the ClusterSecretStore that APC expects.

The deployment orchestrator bootstraps the Flightdeck database as an init container during startup. If the bootstrap fails, the deployment orchestrator Pod doesn’t start. Check the flightdeck-bootstrapper and flightdeck-db-migrations init container logs if the deployment orchestrator fails to come up after enabling this feature.

Step 3: Apply the changes

Apply the updated values to each cluster using helm upgrade. Upgrade the control plane first.

$helm upgrade <release-name> astronomer/astronomer \
> --namespace <namespace> \
> --values values.yaml \
> --version <chart-version>

Run the same command for each data plane cluster, substituting the appropriate release name, namespace, and values file.

Step 4: Verify the deployment

After the upgrade completes, confirm that the new components are running on each cluster.

On the control plane, verify that the following Pods are running:

$kubectl get pods -n <namespace> | grep -E "navigator|dp-link|houston"

On each data plane, verify that the deployment orchestrator started successfully and Pilot is running:

$kubectl get pods -n <namespace> | grep -E "commander|pilot"

Check deployment orchestrator logs to confirm Flightdeck initialized correctly:

$kubectl logs -n <namespace> deployment/<release-name>-commander \
> -c flightdeck-bootstrapper

Advanced configuration

Changing any of the values in this section can meaningfully affect resource usage on your Kubernetes clusters and may adversely affect failover functionality. Change and test these values in a non-production environment before applying them to production.

Tune Pilot behavior

Pilot’s claim, retry, and circuit breaker behavior is configurable via environment variables. Set these under astronomer.pilot.env in your data plane values.yaml.

Environment variableDefaultDescription
PILOT_MAX_INFLIGHT_PER_WORKER5Maximum number of flights Pilot executes concurrently per worker.
PILOT_CLAIM_POLL_INTERVAL_MS5000How often Pilot polls for new flights, in milliseconds.
PILOT_LEASE_TTL_SECONDS60How long a claimed flight lease is valid before expiring.
PILOT_MAX_ATTEMPTS_PER_FLIGHT15Maximum number of execution attempts before a flight is marked as failed.
PILOT_RETRY_BASE_INTERVAL_MS250Base delay between retry attempts, in milliseconds.
PILOT_RETRY_MAX_INTERVAL_MS5000Maximum delay between retry attempts, in milliseconds.
PILOT_CB_FAILURE_THRESHOLD10Number of consecutive failures before the circuit breaker opens.
PILOT_CB_COOLOFF_SECONDS30How long the circuit breaker remains open before allowing probe attempts, in seconds.

For data planes with a larger number of Airflow Deployments (roughly 50 or more), or for cross-region failovers where each Deployment takes longer to come up because the deployment orchestrator has to pull container images from a remote-region registry endpoint or fetch secrets from a remote-region secrets backend, consider raising PILOT_MAX_INFLIGHT_PER_WORKER above the default of 5. A higher value lets Pilot bring more Deployments up on the destination cluster in parallel, which reduces overall failover time and helps amortize cross-region latency. Each in-flight flight runs additional work on the data plane cluster (secret syncs, Helm installs, and database operations) and consumes additional bandwidth to the registry and secrets store, so only raise this value if your data plane cluster has spare CPU, memory, and API server headroom and your registry/secrets backends can handle the extra concurrent traffic. Validate the new value in a non-production environment first.

Tune Navigator behavior

Navigator’s reconcile loop timing is configurable via environment variables. Set these under astronomer.navigator.env in your control plane values.yaml.

Environment variableDefaultDescription
FAILOVER_REQUEST_RECONCILER_INTERVAL_SECONDS10How often Navigator checks for new or in-progress failover requests, in seconds.
MISSION_CLAIM_MIN_BATCH_SIZE10Minimum number of missions Navigator claims per reconcile cycle.
MISSION_PLAN_BATCH_SIZE5Number of missions Navigator plans concurrently.
MISSION_RECONCILE_BATCH_SIZE5Number of missions Navigator reconciles concurrently.
CLAIM_INTERVAL_SECONDS10Interval between mission claim cycles, in seconds.

DP-Link determines cluster health based on heartbeat age. Adjust these thresholds under astronomer.dpLink.env in your control plane values.yaml.

Environment variableDefaultDescription
UNREACHABLE_THRESHOLD_SECONDS90Age of the last heartbeat, in seconds, after which a cluster is marked UNREACHABLE.
DEGRADED_THRESHOLD_SECONDS30Age of the last heartbeat, in seconds, after which a cluster is marked DEGRADED.

Tune the APC API dispatcher behavior

The APC API dispatcher dispatches flights from the control plane to the deployment orchestrator on each data plane. Its loop timing, concurrency, retry, and circuit breaker behavior are configurable through environment variables. Set these under astronomer.houston.env in your control plane values.yaml.

Dispatcher loop

Environment variableDefaultDescription
DISPATCH_LEASE_TTL_SECONDS30How long a dispatcher lease on a flight is valid before expiring, in seconds.
DISPATCH_BATCH_SIZE50Maximum number of flights the dispatcher claims per poll cycle.
DISPATCH_MAX_INFLIGHT50Maximum number of flights the dispatcher executes concurrently across all data planes.
DISPATCH_MAX_INFLIGHT_PER_DP5Maximum number of flights the dispatcher executes concurrently against a single data plane.
DISPATCH_POLL_SECONDS10How often the dispatcher polls for new flights, in seconds.
DISPATCH_MAX_ATTEMPTS_PER_LEASE5Maximum number of in-process retries for a flight while the dispatcher holds its lease.
DISPATCH_MAX_ATTEMPTS_PER_FLIGHT25Maximum durable retries for a flight across all dispatcher processes before it is marked as failed.
DISPATCH_RETRY_COOLOFF_PERIOD60Cool-off, in seconds, applied after a flight crosses DISPATCH_MAX_ATTEMPTS_PER_FLIGHT.
IN_REGION_STARTFLIGHT_RPC_TIMEOUT5000Timeout for StartFlight RPCs to a data plane in the same region as the control plane, in milliseconds.
CROSS_REGION_STARTFLIGHT_RPC_TIMEOUT12000Timeout for StartFlight RPCs to a data plane in a different region from the control plane, in milliseconds.

Circuit breaker

Environment variableDefaultDescription
CB_FAILURE_THRESHOLD10Number of consecutive StartFlight failures to a data plane before the circuit breaker opens.
CB_COOLOFF_SECONDS30How long the circuit breaker remains open before allowing probe attempts, in seconds.
CB_PROBE_MAX_INFLIGHT1Maximum number of probe StartFlight calls allowed while the circuit breaker is half-open.