When working with Airflow, it is important to understand its underlying components of its infrastructure. Even if you mostly interact with Airflow as a DAG author, knowing which components are “under the hood” are and why they are needed can be helpful for running developing your DAGs, debugging, and running Airflow successfully.
In this guide, we’ll describe Airflow’s core components and touch on managing Airflow infrastructure and high availability. Note that this guide is focused on the components and features of Airflow 2.0+. Some of the components and features mentioned here are not available in earlier versions of Airflow.
Apache Airflow has four core components that are running at all times:
- Webserver: A Flask server running with Gunicorn that serves the Airflow UI.
- Scheduler: A Daemon responsible for scheduling jobs. This is a multi-threaded Python process that determines what tasks need to be run, when they need to be run, and where they are run.
- Database: A database where all DAG and task metadata are stored. This is typically a Postgres database, but MySQL, MsSQL, and SQLite are also supported.
- Executor: The mechanism for running tasks. An executor is running within the Scheduler whenever Airflow is up. In the section below, we walk through the different executors available and how to choose between them.
If you run Airflow locally using the Astro CLI, you’ll notice that when you start Airflow using
astrocloud dev start, it will spin up three containers, one for each of the components listed above.
In addition to these core components, there are a few situational components that are used only to run tasks or make use of certain features:
- Worker: The process that executes tasks, as defined by the executor. Depending on which executor you choose, you may or may not have workers as part of your Airflow infrastructure.
- Triggerer: A separate process which supports deferrable operators. This component is optional and must be run separately. It is needed only if you plan to use deferrable (or “asynchronous”) operators.
In the following diagram taken from the Airflow documentation, you can see how all of these components work together:
Airflow users can choose from multiple available executors or write a custom one. Each executor excels in specific situations:
SequentialExecutor: Executes tasks sequentially inside the Scheduler process, with no parallelism or concurrency. This executor is rarely used in practice, but it is the default in Airflow’s configuration.
LocalExecutor: Executes tasks locally inside the Scheduler process, but supports parallelism and hyperthreading. This executor is a good fit for testing Airflow on a local machine or on a single node.
CeleryExecutor: Uses a Celery backend (such as Redis, RabbitMq, or another message queue system) to coordinate tasks between preconfigured workers. This executor is ideal if you have a high volume of shorter running tasks or a more consistent task load.
KubernetesExecutor: Calls the Kubernetes API to create a separate pod for each task to run, enabling users to pass in custom configurations for each of their tasks and use resources efficiently. This executor is great in a few different contexts:
- You have long running tasks that you don’t want to be interrupted by code deploys or Airflow updates
- Your tasks require very specific resource configurations
- Your tasks run infrequently, and you don’t want to incur worker resource costs when they aren’t running.
Note that there are also a couple of other executors that we don’t cover here, including the CeleryKubernetes Executor and the Dask Executor. These are considered more experimental and are not as widely adopted as the other executors covered here.
Managing Airflow Infrastructure
All of the components discussed above should be run on supporting infrastructure appropriate for your scale and use case. Running Airflow on a local computer (e.g. using the Astro CLI) can be great for testing and DAG development, but is likely not sufficient to support DAGs running in production.
There are many resources out there to help with managing Airflow’s components, including:
Scalability is also important to consider when setting up your production Airflow. For more on this, check out our Scaling Out Airflow guide.
Airflow can be made highly available, which makes it suitable for large organizations with critical production workloads. Airflow 2 introduced a highly available Scheduler, meaning that you can run multiple Scheduler replicas in an active-active model. This makes the Scheduler more performant and resilient, eliminating a single point of failure within your Airflow environment.
Note that running multiple Schedulers does come with some extra requirements for the database. For more on how to make use of the HA Scheduler, check out the Apache Airflow documentation.