July 13, 2023

Best Practices for Building an Airflow Service (Part 1)

Viraj Parekh Co-Founder, VP of Sales Engineering Astronomer

Apache Airflow® has been the standard for OSS data engineering for a while. Starting in 2014, 9 years of Airflow has seen over 30K stars, 2.5K contributors, etc. etc. Airflow’s longevity, and the explosion of responsibilities data teams have accumulated over that same period of time has caused Airflow’s footprint within an org to grow. Companies like Lyft, Coinbase, AirBnB, BMG, Ayden, and so many others have built it into the core of their business. The need to deliver data on time, every time, across multiple systems naturally spans across business units, and as a result many companies that started running Airflow for a handful of specific use cases have grown into needing to provide Airflow as a service across multiple teams. This coincided with the growth of data teams and the tools they’d use to be productive.

Often times, this has resulted in:

Teams running a monolithic Airflow environment
Teams running multiple Airflows, across business units.

From what I’ve seen, the second approach is much more scalable than the first. So, if you’re thinking about the right architecture for Airflow for several teams, or you’ve found yourself in an org where there’s “Airflow sprawl,” here’s some things to consider when architecting your “Airflow as a service.”

Location Still Matters

“Here is my source code, run it on the cloud for me. I don’t care how” – Onsi Fakhouri.

As great as it would be if this were the case, it’s only a reality for a narrow set of use cases. The “where” of running code matters, and often determines how it is written. Even though it’s been said “it’s best to use Airflow as just an orchestrator, and not do any computation directly in the Airflow worker,” that’s just not the way a lot of users want to think, as evidenced by the adoption of the Taskflow API. Moreover, it doesn’t always make sense with a given workload.

For orgs that need to support multiple teams using Airflow, there are often use cases that have very different compute needs between teams:

Team A may be charged with processing large files from an on-prem source into the cloud
Team B might only use dbt and Snowflake to model data for business critical dashboards
Team C might be training models before serving them for real time consumption
Team D might be doing experimentation that pulls data from different sources before any regular model training or serving

There are a lot of nuances around how workloads for each of these use cases will behave. For example:

Team A above might have to deal with drivers for HDFS/Spark given they’re processing data on prem. Balancing these requirements with whatever OS level dependencies are also needed across multiple teams is not a fun use of time
Team D will need nodes that have GPU support to run their experiments.
These nodes would be overkill for Team B. There’s always the option of spinning up containers for every task, and assigning them to specific node groups, but the extra spin up time means Team B might not find it worth it to spin up a whole container to run a simple query.

Service providers have to think about all this, and more, such as passing (and cleaning up) data between tasks, balancing python packages to name a few things. This will lead into decisions around ephemeral storage, persistent volumes, and other infrastructure level abstractions data practitioners generally are not knowledgeable on.

Airflow’s longevity has resulted in interfaces for all of these, ones that are often expressed implicitly and explicitly in the DAG code:

You can pass executor_configs in the Kubernetes Executor for full KubeAPI access
Users using the Celery executor will often make assumptions about where files land on workers and how much disc space is available
DAGs will be written assuming a scalable XCOM backend in a particular service to handle that intermediary data

This is both implicit and explicit because although some of this exists in the ways the DAGs themselves are written, most of it is through interfaces that are already available in Airflow:

Airflow executors: certain executors are a better fit for some use cases vs others
Providers: Community hardened interfaces to tools like Snowflake, Sagemaker, etc.
Kubernetes Interfaces: Both with the KubernetesExecutor and the KubernetesPodOperator
Python Virtual Envs: A more lightweight option for not being slowed down by python dependency issues

“Airflow as a service” providers should focus on giving users access and control around these things in a way that best fits their use case. There’s no one size fits all use case, but

Consider Horizontal and Vertical Scale

Scaling an Airflow service usually involves two axes; horizontal scale in terms of number teams and use cases supported, and vertical scale, in terms of number of DAGs, workers, and Airflow components.

Teams often start by scaling a monolithic environment that’s used by one team. Airflow can scale vertically relatively easily by adding schedulers, workers, and a larger database. This can be a good way to start to bring on additional teams, but there’s a limited ceiling of efficacy. As soon as additional teams come on board, a monolithic Airflow deployment needs to constantly balance the needs of several different sets of users and use cases. To ensure security, the data infrastructure team will often become a bottleneck when dealing with things like Connections, Variables, and other environment level details. Currently, there’s also limited support for controlling the scope of which Connections, Variables, and such a given DAG can access.

Teams have gotten around a lot of these limitations in the past by abstracting away the Airflow entirely, including the UI. This can be successful, as it lowers the learning curve for the data practitioner writing DAGs, but it’ll also limit the Airflow features that’ll be exposed. Moreover, it’ll introduce handoff processes for Day 2 concerns of a given DAG.

Lastly, monolithic environments will make it harder to upgrade to later versions of Airflow, which slows productivity of teams who have a use for newer features. Maintenance windows need to be coordinated between business units, and CI/CD systems will need to test for edge cases across a much larger set of use cases.

A much better approach is to provide a different Airflow environment per team, with corresponding dev/stage/prod environments. This lets each team operate in a way that fits their own needs for everything from python environments and packages to upgrade schedules (and infrastructure needs! See section above). Additionally, this often involves less work from core data platform teams as it allows users to have access to Airflow’s full gamut of features.

SDLC Matters

At the end of the day, Airflow is a “pipelines as code” tool. As such, data pipelines need a developer experience and a path to production the same way any other piece of code does. Like the infrastructure section above, there’s no universally “correct” way to do CI/CD with data pipelines. However, there are universal principles:

Like all software, data practitioners need to be able to develop on their laptops
All promotion between dev, stage, and prod environments should be handled from source control
Credentials for connecting to data sources should exist in the environment, not in the DAG file. Secrets managers and cloud native identity solutions are almost a must have.

The more complex part of this is testing DAGs. The new dag.test() command has made this considerably easier, but it’s not a complete solution. There’s still work to be done on finding the best way to test a DAG in terms of the underlying data it is moving. There’s a lot to say here, but the folks at Etsy are going to be presenting a great solution at Airflow Summit on this topic!

Focus on Providing Interfaces (This includes new Airflow features!!)

In addition to core reliability and scalability, data platform teams should focus on providing interfaces that make users more productive. Some argue these interfaces should be most of what data engineers as a whole focus on. When Airflow is originally brought into a team, there are usually users who are comfortable with Python and can use Airflow’s operators and providers to write DAGs. As the set of users and use cases grow, it’s important to add additional interfaces that meet each of these users where they are at.

Once again, there’s a lot inside of Airflow OSS that helps with this, particularly the Taskflow API and decorators. However, simply adding a more pythonic interface isn’t always enough. Many teams need declarative approaches that abstracts knowledge of Airflow, and sometimes even Python away. Or perhaps they need something specifically for data science/machine learning workloads.

To keep developers productive, data platform teams should look to serve users opinionated DAG writing frameworks. Although none exists in core Airflow, there are several in the community that can be adopted.

These libraries are great not only because they can be mixed and matched (it’s not an all or nothing approach), but also because they often take advantage of new features within core Apache Airflow®.

Cosmos automatically infers the datasets used in your dbt models
The AstroDatabricks Provider relies on TaskGroups to lets users take advantage of cheaper compute clusters on Databricks
AstroSDK takes full advantage of the TaskflowAPI, core providers, and dataset objects
Tools like Metaflow and LineaPy allow users to directly take data science workloads and export them as Airflow DAGs

More can be found on the Airflow Ecosystem page. This is especially powerful when combined with the “multiple Airflows” approach as it gives each team can pick and choose the interface they need.

TLDR

If you’re a data infrastructure team tasked with providing an Airflow service, there’s a lot of things you’ll have to consider. As you work through the specifics of your use case, some best practices to take into account boil down to:

Focus on providing access to all of Airflow’s features – teams will need them.
Keep Airflow up to date for performance and interfaces.
Don’t run a monolithic environment. Let teams operate independently of one another
Focus on providing strong SDLC for DAGs
Provide interfaces that expand the personas that have access, but try to do so through a pre-existing project. Otherwise, be mindful of what you’ll need to maintain

Stay tuned for PT II where we’ll talk more about testing, dependencies between teams, and deploying DAGs.