Leveraging Apache Airflow and Kubernetes for Data Processing

  • Constance Martineau

In the fast-paced world of data-driven decision-making, efficient data pipelines are essential for modern businesses. Apache Airflow has emerged as a popular workflow orchestration tool that enables the seamless management of data pipelines.

As an orchestrator, Airflow acts as a central control hub, and empowers users to define, schedule and monitor complex workflows. Although Airflow was designed as an orchestration tool, when combined with Kubernetes, a popular container orchestration system, Airflow reaches its full potential, providing a robust and scalable task execution framework for data processing. If you want to use Airflow to run a python function that uses pandas or polars to load and transform data in-memory, you now can!

In this post, we’ll explore how to use Apache Airflow with Kubernetes Executor as a data processing tool, and share some best practices and tips for scaling your workloads.

Introduction to Apache Airflow and Kubernetes Executor

Apache Airflow allows you to define complex computational workflows as code, making them more maintainable, versionable, testable, and collaborative. The Kubernetes Executor is an Airflow feature that allows you to execute multiple Airflow tasks in parallel, each within its own isolated environment, using Kubernetes Pods.

The Kubernetes Executor offers several advantages, including:

1. Resource isolation and task-level configurations:

Each task runs in an isolated environment, reducing interference between tasks and allowing resources to be specified at an individual task level. This isolation and flexibility enhances security and stability, which are vital for complex workflows.

2. Cost and resource efficiency:

With the Kubernetes Executor, your web server and scheduler costs remain fixed. However, compared to the celery executor, the dynamic scaling of task instance pods allow you to shed the fixed cost of having a celery worker up for 24 hours a day.

On Astro, we scale down celery workers to zero when they are not actively running tasks, which helps keep costs down. However, the overhead of running celery worker sub-processes combined with how scale-down works makes Kubernetes executor the more cost-efficient choice. We’ll explore that more in another blog post!

3. No interruption to running tasks:

Some tasks, like machine learning tasks, are expensive to restart. Kubernetes considers nodes with running task instance pods as unsafe to evict, meaning you can reliably run 24-hour or longer workloads.

4. Cluster Orchestration:

With the Kubernetes Executor, Airflow can distribute tasks across a Kubernetes cluster, enabling parallel execution and faster completion of data pipelines.

While the Kubernetes Executor offers numerous benefits, there are important factors to consider before adopting it for scaling workflows, especially when self-hosting:

5. Task execution time:

There is an overhead of a few extra seconds per task for a pod to spin up. This could slow down tasks that are intended to finish executing particularly quickly.

6. Kubernetes familiarity and infrastructure readiness:

Understanding Kubernetes and ensuring your Kubernetes cluster is ready to handle Airflow tasks efficiently is essential to leverage the Kubernetes Executor effectively, especially if self-hosting. Kubernetes-based Airflow deployments may introduce additional operational complexities compared to simpler executors. Evaluate the available resources and assess whether they align with your workflow demands.

On Astro, we manage Kubernetes on your behalf and handle infrastructure scaling, so this isn’t much of a concern. We even highlight important details such as the max cpu or memory you can set for your task instance pods!

Best Practices and Implementation Tips

To ensure a successful implementation of the Kubernetes Executor for scaling your data pipelines, consider the following best practices:

1. Optimize the Base Airflow Image:

Remove unnecessary dependencies from your Docker images to speed up container deployment and reduce resource usage.

2. Set CPU and Memory Requirements for Every Airflow Tasks:

To prevent resource contention amongst Airflow tasks and ensure smooth task execution, set appropriate resource requests and limits to every single Airflow Task. On Astro, we help by letting you set a default pod size in the UI, so you only need to worry about setting exceptions within the DAG.

3. Tune Your Worker Pods Creation Batch Size:

The worker_pods_creation_batch_size (environment variable: AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE) determines how many pods can be created per scheduler loop. The default is 1 in open source, but you’ll want to increase this number for better performance, especially if you have concurrent tasks. The maximum value is determined by the tolerance of your Kubernetes cluster. On Astro, we set this to 16.

4. Set Resource Quotas and Limit Ranges:

If you’re running more than one Airflow deployment, or are running other applications in your cluster, you need to make sure that Airflow deployments don’t consume all cluster resources. Setting resource quotas and limit ranges for namespaces that contain Airflow deployments ensures that Airflow tasks don’t consume more resources than what they are allowed to use. On Astro, we let you configure quotas in the UI! Quotas can be updated over time as your workload requirements change.

5. Use Cluster Autoscaler to Scale Nodes:

Use a scaling utility like Kubernetes Cluster Autoscaler to dynamically adjust the cluster size based on task demands, and get the benefit of Kubernetes’ elasticity. Some managed Kubernetes services, such as Google Kubernetes Engine (GKE), handle this on your behalf.

6. Setup Monitoring and Logging:

Implement comprehensive monitoring and logging solutions to keep a close eye on your Kubernetes cluster’s resources. If tasks remain in a queued state but are not running, it could indicate that the number of tasks being scheduled is beyond the capacity of your Airflow infrastructure. In this case, check that there are available resources in the namespace and see if worker_pods_creation_batch_size can be increased.

Conclusion

Scaling data pipelines with Airflow’s Kubernetes Executor opens up exciting possibilities for managing large and complex workflows effectively.

By following best practices and implementing the provided tips, you can ensure a smooth and efficient scaling process. Remember to consider your infrastructure readiness and assess the trade-offs between complexity and benefits before making the leap. With a well-planned approach, Airflow’s Kubernetes Executor can be a game-changer, empowering you to leverage Airflow as an orchestrator AND as a data processing tool.

Astro, Astronomer’s developer-first, fully-managed Airflow service, supports the Kubernetes Executor. You can try out Apache Airflow with the Kubernetes Executor by signing up for a free trial of Astro.

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.