Disaster recovery
The Astro Data Plane is designed to withstand in-region Availability Zone (AZ) degradations and outages as described in Resilience. For full region outages on AWS dedicated clusters, Astro supports self-service cross-region disaster recovery (DR).
Cross-region disaster recovery (AWS dedicated clusters)
Public Preview
This feature is in Public Preview.Self-service cross-region disaster recovery requires the Enterprise Business Critical tier and is currently available for AWS dedicated clusters only. GCP and Azure support are planned for later this year.
Cross-region DR lets you configure a pair of dedicated clusters — a primary and a secondary — in different AWS regions. The secondary cluster stays continuously synchronized with the primary so you can fail over with minimal downtime and data loss. After failover, Astro automatically enables synchronization in the reverse direction, keeping the original primary ready for failback. When the primary region recovers, you can fail back with a single click.
How AWS disaster recovery works
- The primary cluster runs all Deployments in Region A.
- A multi-region database replicates Deployment metadata to the secondary cluster in Region B.
- Multi-region object storage copies task logs to the secondary cluster.
- User-deployed images are replicated to the secondary cluster.
- On failover, the secondary cluster is promoted to active. All Deployments, configuration, environment variables, connections, and Airflow variables transfer automatically.
- Clusters and Deployments retain their IDs, names, namespaces, and system-managed configuration after failover. All hostnames — including the Airflow UI, Airflow API, and Remote Execution API URLs — are updated to point to the secondary cluster and remain the same.
RTO and RPO
The following table defines the recovery time objective (RTO) and recovery point objective (RPO) for DR clusters. Targets are benchmarked with 80+ Deployments and 1,250+ concurrent task runs.
See Task Logs Replication SLA for details on the RPO guarantee.
What gets failed over
The following items transfer to the secondary cluster automatically during failover:
- Deployments and data pipelines
- Dag run history, task instance metadata, and XComs
- Deployment configuration
- Environment variables, connections, Airflow variables, and metrics exports — whether configured via Environment Manager or directly on the Deployment
- Task logs. Enable Task Logs Replication SLA for a guaranteed 15-minute RPO.
The following items do not transfer automatically and require manual steps after configuring the secondary cluster:
- Networking and DNS configuration. Configure using self-service features such as VPC peering or Customer Managed Egress, or work with Astronomer support.
imagePullSecretsfor Kubernetes Pod Operators (KPOs)- Customer-managed workload identities. You must configure the OIDC issuer and IAM trust policies for the secondary cluster separately. See Workload identity.
- Customer-managed Transit Gateway routing on the secondary cluster