Improving Reliability with a More Resilient Auth Proxy Architecture
4 min read |
Reliability is table stakes for running Airflow at scale. Over the past year, we’ve seen firsthand how even brief disruptions in authentication can cascade into failed DAG runs, blocked API access, and degraded UI experiences—despite healthy dataplanes.
While this service was highly scaled for throughput, it represented a single-region dependency for the entire control plane.
Today, we’re rolling out a major architectural improvement to Astro’s authentication layer that materially improves platform resilience, availability, and long-term scalability.
What’s Changing
Dataplane-Based Forward Authentication
Each dataplane now runs its own forward-auth service with URI-aware authentication logic, secure encrypted cookies, and Auth0-to-Airflow token exchange. This eliminates the need for a centralized Auth Proxy to mediate every request. The Astro UI routes traffic through new, more resilient paths that do not depend on a single region, while programmatic API access remains backward compatible with no breaking changes.
Controlled Rollout with Feature Flags
We’ve added backend controls and UI feature flags that allow us to gradually route deployments to new authentication paths, monitor behavior and performance in real time, and roll out changes safely without opening the floodgates all at once.
Deployment-Scoped API Tokens
To further decouple workloads from the control plane, we’re introducing deployment-scoped Airflow JWTs and token-based API access paths that continue working during control plane outages. UI and API surfaces will generate and manage deployment tokens. This dramatically reduces the risk of auth-related DAG failures.
Security, Validation, and Testing
This change improves our overall security posture by removing a single, Airflow-wide target and distributing authentication traffic across dataplanes.
We’ve invested heavily in validation and testing:
- End-to-end monitoring validates both old and new paths,
- E2E tests cover new URIs and proxy flows,
- Simulation of control plane outages to verify failover behavior.
Direct Access Tokens
To further decouple your workloads from control plane dependencies, we’re introducing Direct Access Tokens: a new API token type that provides continued access to your Airflow instances during control plane outages. These tokens are now available at three scopes: organization-level (manage resources across your entire Astro organization), workspace-level (control deployments within a workspace), and deployment-level (direct access to individual Airflow instances).
Direct Access Tokens use deployment-scoped Airflow JWTs that continue working when the control plane is unavailable. Combined with the new authentication architecture, this dramatically reduces the risk of auth-related DAG failures during incidents. Organization Owners can create direct access tokens through the Astro UI, and they have the same permissions as the Workspace Operator role for Deployment-level operations.
Why This Matters
For teams running mission-critical Airflow workloads where uptime and resilience are non-negotiable, these improvements deliver:
Reduced blast radius – Control plane incidents no longer block access to healthy dataplanes
Operational continuity – Dags that depend on the Airflow API can continue running during outages
Improved recovery times – Faster RTO/RPO and reduced data pipeline risk during incidents
This work directly addresses reliability issues experienced by our customers and removes a root cause behind several platform incidents this year. It also allows us to operate Astro more efficiently by reducing centralized compute and egress costs, helping us reinvest in performance and reliability improvements across the platform.
Most importantly, it ensures that your Airflow workloads keep running—even when parts of the control plane don’t. This is what we mean when we talk about building the Astro Engine: a foundation designed for enterprise reliability, where your data pipelines remain operational regardless of infrastructure challenges.
What Customers Need to Do
For most customers, no action is required.
A small number of customers with strict network allowlists or custom networking configurations may need to update filters to allow new IP ranges. If this applies to you, we’ll reach out directly with clear guidance.
To start using Direct Access Tokens, visit the Astro UI to generate tokens at the organization, workspace, or deployment level. See the Deployment API tokens documentation for implementation details and best practices.