Data plane failover

Data plane failover is a resiliency feature that moves all Apache Airflow Deployments from a source data plane cluster to a destination data plane cluster. When you trigger a failover, Astro Private Cloud (APC) applies each Deployment’s configuration and secrets to the destination cluster, gets it up and running there, and cleans up the source side with minimal manual intervention.

Failover is a full-cluster operation — every Deployment on the source cluster is included. It is asynchronous: after you submit a request, the platform drives execution through a state machine until every Deployment is running on the destination cluster or has failed with an error that requires operator attention.

How it works

A failover request moves through the following stages:

  1. You submit a failover request from the APC UI, specifying a source cluster, a destination cluster, and a failover mode.
  2. APC creates a FailoverRequest record and transitions it to IN_PROGRESS.
  3. Navigator (the control plane component that orchestrates the failover) creates one mission per Deployment included in the failover request, along with a pair of flights for each mission — one targeting the source cluster and one targeting the destination cluster.
  4. Dispatcher workers dispatch each flight to its target cluster, where Pilot (the data plane execution agent) picks it up and runs it.
  5. Pilot executes each flight plan — the series of steps that make up a flight — to either bring the Deployment up on the destination cluster or drain and delete the Deployment on the source cluster.
  6. After all missions complete, Navigator marks the FailoverRequest as SUCCEEDED or FAILED.

Failover modes

ModeBehavior
ControlledDrains Airflow Deployment components on the source cluster and waits for in-flight tasks to finish, up to a configured timeout, before promoting the destination. Use for planned maintenance or migrations where task loss isn’t acceptable.
ForcedPromotes the destination cluster immediately without waiting for source Deployments to drain. Use when the source cluster is unreachable or when speed is the priority.

Cluster eligibility

The Trigger Failover button in the UI is active only when failoverEnabled is true on the source cluster, when external-secrets.enabled is true in both the source and destination clusters, global.dataPlaneFailover.externalSecretManagerName is set, and a valid, authenticated ClusterSecretStore exists.

The destination cluster dropdown shows only clusters that APC considers schedulable targets for the selected source. A cluster appears as a valid target when it is registered, healthy, and has no pending failover operations targeting it as a destination.

APC doesn’t compare APC versions between the source and destination data planes before a failover. You are responsible for keeping the source and destination clusters on compatible APC versions.

Components

Data plane failover adds several components that aren’t deployed in a standard APC installation. Each component runs on either the control plane or the data plane, as described in the following table.

ComponentPlaneDescription
NavigatorControlControl plane component that decides when and where each Airflow Deployment moves by creating and managing FailoverRequest and Mission records and producing the flights that carry out each mission.
DP-LinkControlMaintains persistent gRPC streams to each registered data plane. Monitors heartbeats and updates cluster health status (HEALTHY, DEGRADED, UNREACHABLE).
DispatcherControlAPC Worker process that issues StartFlight remote procedure calls to the deployment orchestrator on the data plane.
PilotDataData plane component that claims flights from the Flightdeck queue and executes flight plans: namespace creation, secret application, Deployment upsert, database fencing, drain Deployment, delete Deployment.
FlightdeckDataPostgreSQL or MySQL-backed queue table (dp_flights) shared by the deployment orchestrator and Pilot. The deployment orchestrator writes flights; Pilot claims and executes them.

Secret replication

APC uses the External Secrets Operator (ESO) to replicate Airflow secrets between data planes. When failover is enabled:

  • The deployment orchestrator on the source cluster creates PushSecret custom resources that write Airflow secrets (fernet key, environment variables, and database credentials) into an external secrets store through a ClusterSecretStore.
  • The deployment orchestrator on the destination cluster creates ExternalSecret custom resources that pull those secrets from the same store into the destination namespace.

Both the source and destination data plane clusters must be able to reach the same external secrets store.

Database requirements

APC provisions logical databases and database users automatically when you create a Deployment, but it doesn’t provision database servers. You must provide a database server hostname that is network-accessible from both the source and destination data plane clusters.

Supported topologies

Two database server topologies are supported:

  • Shared server: both clusters connect to the same database server hostname. This is typically a cloud-managed database server (for example, AWS RDS) reachable from both cluster networks.
  • Synchronized servers: a primary database server is kept in sync with one or more replicas through a customer-managed replication mechanism. APC must connect through a single stable hostname or endpoint that always resolves to the active primary — the source and destination clusters don’t point at their own per-cluster endpoints. During failover, you are responsible for promoting the replica and updating that endpoint to point at the new primary before initiating the APC data plane failover.

In either topology, you are responsible for setting up network access from your data plane clusters to the database server, and for managing replication, primary promotion, and endpoint cutover if you use the synchronized topology.

Expected failover order

When performing regional failover with synchronized database servers, the expected sequence is:

  1. Database replica promotion.
  2. Endpoint updated to point to the new primary.
  3. APC data plane failover initiated.

How APC uses the database server

When a Deployment is created with failover enabled, APC provisions two sets of logical databases and credentials on the database server: an active set used by the source cluster and an inactive set used by the destination cluster. APC immediately blocks the inactive credentials from connecting after provisioning.

During failover, the deployment orchestrator performs database fencing for each Deployment: it revokes connect access from the active credentials and grants it to the inactive credentials. This ensures only one cluster writes to a given Deployment’s logical databases at a time. Fencing is per Deployment, so individual Deployments can be migrated at different times during a single failover request.

Per-Deployment database users

To prevent split-brain writes during failover, APC fences each Airflow Deployment at the database level using two database users per Deployment. APC supports two ownership models:

  • APC manages users: You grant APC permission to create and manage Airflow database users (for example, ALTER ROLE), and APC handles the rest. You don’t take any further action.
  • Customer manages users: You create the two login roles per Airflow Deployment yourself (one per data plane cluster). APC still owns each Deployment’s Airflow metadata database and fences those login roles during failover, so you grant the deployment orchestrator database user an owner role and a connection-terminator role for every Deployment. For the exact SQL setup, see Customer-created database users.

Container registry requirements

APC doesn’t replicate container images between regions. You configure a single container registry endpoint per APC installation on the control plane, and every data plane in the installation pulls Airflow Deployment images from that endpoint. APC currently doesn’t support different registry endpoints per region.

Required capabilities

  • You configure one externally managed container registry endpoint on the control plane, and every data plane uses it.
  • The registry serves the same repository paths and tags from every region where a data plane may run, whether through a globally routed endpoint, customer-managed cross-region replication behind a single hostname, or another mechanism. An image reference like registry.example.com/my-org/airflow:1.2.3 must resolve to the same image regardless of which data plane pulls it.
  • Every data plane cluster has network access and credentials to pull from that endpoint.

Replication and failover eligibility

If you back the registry endpoint with cross-region replication, replication latency determines when a Deployment is eligible to run on a destination data plane. If a Deployment’s image hasn’t yet replicated to the region serving the destination data plane, APC can’t start that Deployment there — Pilot’s upsert step fails until the image becomes available. Size your registry replication SLA to be faster than your expected failover window for the Deployments that must be able to fail over.

Airflow log sink requirements

APC ships Airflow task logs from each data plane to an external Elasticsearch sink. After a Deployment moves between data planes, you still need to be able to read its task logs through the same UI, so the log sink topology matters for failover.

For details on configuring Vector and the Elasticsearch sink itself, see Configure task log collection and exporting to ElasticSearch.

Supported topologies

APC supports two Elasticsearch topologies for failover:

  • Single shared Elasticsearch endpoint: Every data plane in the APC installation ships logs to the same Elasticsearch endpoint. Logs from the source and destination clusters land in the same backend, so post-failover log lookups continue to work without any further configuration. This is typically a multi-region or globally routed Elasticsearch service that all data planes can reach over the network.
  • Active-active Elasticsearch per region: You run an Elasticsearch cluster in each region with bidirectional replication between them. You configure each regional data plane with its own regional Elasticsearch endpoint, and you manage the cross-region replication so that logs written by either side are visible from both. Each data plane writes to the closest endpoint, and queries from the control plane resolve against any region.

In either topology, you are responsible for sizing, securing, and operating the Elasticsearch infrastructure, and (in the active-active topology) for managing replication between regions.

Prerequisites

Before enabling data plane failover, confirm the following:

  • You have an APC installation with separate control and data plane clusters (global.plane.mode: control on the control plane, global.plane.mode: data on each data plane). Failover isn’t supported in unified mode.
  • You have an external secrets store supported for APC data plane failover. APC currently supports AWS Secrets Manager and Google Cloud Secret Manager through ESO.
  • You have configured a ClusterSecretStore custom resource in your data plane clusters that points to that secrets store.
  • Your source and destination data plane clusters have network access to the external secrets store.
  • You have an external sink for Airflow logs — an external Elasticsearch instance — that is reachable from every data plane cluster. Shipping logs to an external sink is required so that Airflow task logs remain accessible after a Deployment moves between data planes. For supported topologies, see Airflow log sink requirements.
  • You have a destination cluster that is registered with the APC control plane and is healthy.
  • You have a single externally managed container registry endpoint, configured on the control plane, that every data plane in the APC installation can pull Airflow Deployment images from. For requirements, see Container registry requirements. To configure the registry backend, see Use a registry backend.

Limitations

The initial APC 2.0 release of data plane failover has the following limitations:

  • Image-based Deployments only: Failover is supported only for image-based Airflow Deployments. Deployments that use git-sync or Dag-only deploy mechanisms aren’t supported and can’t be failed over.
  • New clusters and Deployments only: Failover can’t be enabled on a data plane cluster that already has Airflow Deployments on it. The feature is supported only on newly registered failover-enabled data plane clusters and on Deployments created on those clusters after failover has been enabled. Existing Deployments on a pre-existing cluster can’t be retroactively brought under failover management.