Trigger a data plane failover | Astronomer Documentation

Use this guide to trigger a data plane failover from the Astro Private Cloud (APC) UI. A failover moves all Airflow Deployments from a source cluster to a destination cluster. The process runs asynchronously — after you submit the request, APC handles execution without further input.

For a conceptual overview of how failover works, see Data plane failover. To enable the feature before triggering it, see Enable data plane failover.

Prerequisites

Data plane failover is enabled on your APC installation. See Enable data plane failover.
The source cluster has failoverEnabled set to true. If the Trigger Failover button appears dimmed with the message “Failover isn’t enabled for this cluster”, this means one of the feature configuration requirements hasn’t been satisfied. Double check the configuration instructions, running pods, and pod logs.
At least one healthy destination cluster is registered with the same APC control plane as the source cluster.
You have permission to update clusters in the APC UI.

Trigger a failover

Open cluster details

In the APC UI, go to the cluster list and click the source cluster — the cluster you want to fail over from.

Click trigger failover

In the top-right corner of the cluster details page, click Trigger Failover.

If the button appears dimmed, the source cluster isn’t eligible for failover. See the prerequisites above.

Select a destination cluster

In the Destination Cluster dropdown, select the cluster you want to fail over to. The dropdown lists only clusters that APC considers valid targets for the source cluster.

If no destinations appear, verify that at least one other cluster is registered, healthy, and reachable from the control plane.

Select a failover mode

In the Mode dropdown, select one of the following:

Controlled: Drains Airflow Deployment components on the source cluster and waits for in-flight tasks to finish, up to a configured timeout, before promoting the destination. Use for planned maintenance or migrations where task loss isn’t acceptable.
Forced: Promotes the destination cluster immediately without waiting for source Deployments to drain. Use this mode when the source cluster is unreachable or when speed takes priority over task completion.

Submit the request

Click Trigger Failover to submit the request.

APC queues the request and begins execution asynchronously. A success notification confirms the request was accepted. If the request fails immediately, an error message describes the cause.

Monitor failover progress

After you submit a failover request, APC transitions each Deployment on the source cluster through its own migration state machine. You can monitor progress by checking cluster health status in the APC UI.

The source cluster status transitions to FAILING_OVER while the request is in progress. If any Deployment migration fails, the cluster status transitions to FAILOVER_FAILED. This status is only visible directly within the Postgres database.

Failover execution continues in the background even if you close the browser or navigate away from the cluster details page.

What happens during a failover

For each Deployment on the source cluster, APC:

Replicates Airflow secrets — including the fernet key, environment variables, and database credentials — to the destination cluster using the External Secrets Operator.
Creates the Airflow namespace on the destination cluster.
In Controlled mode, drains and deletes Deployments on the source cluster before continuing. In Forced mode, this step is skipped.
Fences the source database connection to prevent split-brain writes.
Creates the Airflow Deployment on the destination cluster with the same configuration, name, namespace, and ID as the source Deployment.

After all Deployments migrate successfully, the failover request transitions to SUCCEEDED.

If you want a Dag run to resume after a failover, make sure the Dag sets retries greater than 0 and that the tasks within it are idempotent. Airflow reschedules on the destination cluster any tasks that were running on the source cluster when failover started. Airflow only retries them if the task’s retry count allows it.