Scaling Out Airflow Webinar

Note: To learn more about how to scale Airflow check out our Scaling Airflow to optimize performance guide.

1. Key Points About Scaling Airflow

Virtually unlimited scaling potential
Use CeleryExecutor and KubernetesExecutor
Tune parameters to fit your needs
Easy to scale to more capacity
Aggregate logging is important

2. High-Level Steps to Scale Airflow

scaling-out-airflow-image3

2. Why Scale Apache Airflow®?

Because workload outgrows your initial infrastructure
- More DAGs, more Tasks
- More intensive individual tasks
Because your core Airflow components need more durability
To Prepare for more DAGRuns and compute load
To benefit from elasticity — save money by scaling as needed

4. Symptoms That Mean You’re Ready to Scale

Many tasks stuck in Queued or Scheduled state
Unacceptable latency between tasks
Missing SLAs
High resource usage on Scheduler or Webserver
Out of Memory (OOM) errors on Tasks

Principles of Scaling Systems

5. Basics of Scaling Systems

Vertical Scaling
- Increase the size of instance
  - (RAM, CPU, etc.)

scaling-out-airflow-image6

Adds more power to an existing worker
Gives individual tasks more horsepower
Gets very expensive, very quickly
If vertical scaling reaches a threshold, think about delegating work to a dedicated distributed processing engine — e.g., Spark, Dask, Ray
Horizontal Scaling
- Increase number of instances
  - (RAM, CPU, etc.)
  - Adds more nodes to the cluster
  - Increases maximum number of tasks and DAGRuns that the system can handle
  - Fits Airflow’s orchestration model
  - Celery and Kubernetes executors well designed for horizontal scaling

Scaling Airflow as a Distributed Platform

6. Scaling with CeleryExecutor

https://airflow.apache.org/docs/apache-airflow/stable/executor/celery.html

Allows for easy horizontal scaling
Runs workers’ processes that process TaskInstances
To scale, add a new worker process
- Can be on a new node or an existing node
- The connection between workers uses Celery broker and metadatabase
- $AIRFLOW_HOME looks identical to other worker nodes

7. Scaling with KubernetesExecutor

https://airflow.apache.org/docs/apache-airflow/stable/executor/kubernetes.html TaskInstances run on K8s pods

TaskInstance pods are ephemeral
Each task get its own pod
No workers’ processes

scaling-out-airflow-image8

Parameter Tuning when Scaling Airflow

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html

parallelism

scaling-out-airflow-image4

max_active_runs_per_dag

scaling-out-airflow-image9

max_active_tasks_per_dag

scaling-out-airflow-image10

worker_concurrency

scaling-out-airflow-image2

Pool size

These parameters control the number of tasks that can be run at a time.

8. Sizing Pools

Pools are another way that we can limit the amounts of tasks that can run
Can be used to circumvent executor slots being hogged by heavy DAGs with a lot of tasks
Limits the number of running and queued tasks (active tasks that are under the control of the executor)
Groups tasks together to limit the active instances by group

High Availability Airflow Components

Other Airflow Components can Scale!

Scheduler

Airflow allows for multiple schedulers
Increases the number of tasks that can be scheduled
Allows the scheduling platform more stability

Web Server

Multiple web servers increase the load and capacity of the Web UI

Logging on Distributed Airflow

9. Traits of a Great Logging System

Aggregated
Historical
Indexed
Searchable

scaling-out-airflow-image7

10. Importance of Good Logging Practices

85% of problems have clues for a solution in the scheduler logs
Important, if there are multiple schedulers, to be able to collect all of their logs in one place, since those components are working together on the same solution
Need to keep a history of the logs in a searchable format in order to diagnose problems and work out solutions
The first step for debugging is to correlate timestamps between problems in task logs with scheduler logs at the same time

11. Debugging Distributed Airflow

90% of problems have clues for a solution in the scheduler logs
8% of problems are resource consumption issues
- Out of memory
- CPU limited cycles for tasks
The other 2% is the hard part

Scaling Out Airflow

1. Key Points About Scaling Airflow

2. High-Level Steps to Scale Airflow

2. Why Scale Apache Airflow®?

4. Symptoms That Mean You’re Ready to Scale

Principles of Scaling Systems

5. Basics of Scaling Systems

Scaling Airflow as a Distributed Platform

6. Scaling with CeleryExecutor

7. Scaling with KubernetesExecutor

Parameter Tuning when Scaling Airflow

8. Sizing Pools

High Availability Airflow Components

Logging on Distributed Airflow

9. Traits of a Great Logging System

10. Importance of Good Logging Practices

11. Debugging Distributed Airflow

How to Scale your Deployment - Demo

See More Resources

What’s New in the Astro Platform Release, Q1 2024

Implementing data quality checks with the Astro SDK

How to Orchestrate Databricks Jobs Using Airflow

The Airflow API

Try Astro for Free for 14 Days

Build, run, & observe your data workflows. All in one place.

Build, run, & observe
your data workflows.
All in one place.