Airflow system components
A Deployment in Astro Private Cloud (APC) consists of multiple components that work together to orchestrate and execute your data pipelines. Each component has a specific role and configuration options.
Core components
Scheduler
The scheduler is the heart of Airflow. It monitors all Dags and tasks, triggers task instances when dependencies are complete, and submits tasks to the executor for running.
Default configuration:
Key responsibilities:
- Schedule tasks based on dependencies and triggers.
- Monitor task states and handle retries.
- Manage pools and task queues.
- Parse Dag files and create Dag runs. When the Dag processor is enabled, Dag parsing moves to the Dag processor and the scheduler handles only scheduling.
Webserver
The webserver provides the Airflow UI for monitoring Dags, viewing logs, triggering runs, and managing configurations.
Default configuration:
Key features:
- Dag visualization and monitoring.
- Task log viewing.
- Variable and connection management.
- User authentication and authorization.
Workers
Workers execute tasks. Workers run as a persistent deployment only when using Celery Executor. When using Kubernetes executor, Airflow launches ephemeral task pods instead.
Default configuration (Celery Executor):
Triggerer
The triggerer handles deferrable operators, allowing tasks to release worker slots while waiting for external events.
Default configuration:
Dag processor
The Dag processor parses Dag files and updates the metadata database with Dag definitions. Available in Airflow 2.3+ and mandatory in Airflow 3+.
Default configuration:
In Airflow 3, the Dag processor is automatically enabled and required for Dag discovery.
API server (Airflow 3+)
The API server is a new component in Airflow 3 that provides the REST API, separated from the webserver for better scalability.
Default configuration:
The platform manages the number of API server replicas.
Supporting components
Redis
Message broker for Celery Executor. Handles task queue communication between scheduler and workers.
StatsD exporter
Collects and exports Airflow metrics for monitoring systems like Prometheus.
PgBouncer (optional)
Connection pooler that sits between Airflow components and the metadata database. Reduces the number of direct database connections opened by the scheduler, webserver, and workers.
PgBouncer is only enabled when the cluster uses PostgreSQL and pgbouncer.enabled is set to true in your platform configuration. It is disabled when the cluster uses MySQL.
Flower
Web UI for monitoring Celery workers. Only active when using Celery Executor.
Airflow 2 vs Airflow 3 components
Resource recommendations
Small workloads (< 50 Dags)
Medium workloads (50-200 Dags)
Large workloads (200+ Dags)
Scaling components
Horizontal scaling
Components that support multiple replicas. Default limits apply unless your platform administrator overrides them in the platform configuration.
- Scheduler: Up to 4 replicas by default.
- API server: Up to 4 replicas by default.
- Dag processor: Up to 3 replicas by default.
- Workers: Up to 10 replicas by default.
- Triggerer: Up to 2 replicas by default.
Vertical scaling
Increase resources for:
- Scheduler: Complex dependencies or high task volume. When the Dag processor is enabled, the scheduler focuses on scheduling only.
- Dag processor: Large number of Dag files or complex parsing requirements.
- Workers: Memory-intensive tasks.