Best Practices for Building an Airflow Service (Part 1)

  • Viraj Parekh

Apache Airflow has been the standard for OSS data engineering for a while. Starting in 2014, 9 years of Airflow has seen over 30K stars, 2.5K contributors, etc. etc. Airflow’s longevity, and the explosion of responsibilities data teams have accumulated over that same period of time has caused Airflow’s footprint within an org to grow. Companies like Lyft, Coinbase, AirBnB, BMG, Ayden, and so many others have built it into the core of their business. The need to deliver data on time, every time, across multiple systems naturally spans across business units, and as a result many companies that started running Airflow for a handful of specific use cases have grown into needing to provide Airflow as a service across multiple teams. This coincided with the growth of data teams and the tools they’d use to be productive.

Often times, this has resulted in:

From what I’ve seen, the second approach is much more scalable than the first. So, if you’re thinking about the right architecture for Airflow for several teams, or you’ve found yourself in an org where there’s “Airflow sprawl,” here’s some things to consider when architecting your “Airflow as a service.”

Location Still Matters

“Here is my source code, run it on the cloud for me. I don’t care how” – Onsi Fakhouri.

As great as it would be if this were the case, it’s only a reality for a narrow set of use cases. The “where” of running code matters, and often determines how it is written. Even though it’s been said “it’s best to use Airflow as just an orchestrator, and not do any computation directly in the Airflow worker,” that’s just not the way a lot of users want to think, as evidenced by the adoption of the Taskflow API. Moreover, it doesn’t always make sense with a given workload.

For orgs that need to support multiple teams using Airflow, there are often use cases that have very different compute needs between teams:

There are a lot of nuances around how workloads for each of these use cases will behave. For example:

Service providers have to think about all this, and more, such as passing (and cleaning up) data between tasks, balancing python packages to name a few things. This will lead into decisions around ephemeral storage, persistent volumes, and other infrastructure level abstractions data practitioners generally are not knowledgeable on.

Airflow’s longevity has resulted in interfaces for all of these, ones that are often expressed implicitly and explicitly in the DAG code:

This is both implicit and explicit because although some of this exists in the ways the DAGs themselves are written, most of it is through interfaces that are already available in Airflow:

“Airflow as a service” providers should focus on giving users access and control around these things in a way that best fits their use case. There’s no one size fits all use case, but

Consider Horizontal and Vertical Scale

Scaling an Airflow service usually involves two axes; horizontal scale in terms of number teams and use cases supported, and vertical scale, in terms of number of DAGs, workers, and Airflow components.

Teams often start by scaling a monolithic environment that’s used by one team. Airflow can scale vertically relatively easily by adding schedulers, workers, and a larger database. This can be a good way to start to bring on additional teams, but there’s a limited ceiling of efficacy. As soon as additional teams come on board, a monolithic Airflow deployment needs to constantly balance the needs of several different sets of users and use cases. To ensure security, the data infrastructure team will often become a bottleneck when dealing with things like Connections, Variables, and other environment level details. Currently, there’s also limited support for controlling the scope of which Connections, Variables, and such a given DAG can access.

Teams have gotten around a lot of these limitations in the past by abstracting away the Airflow entirely, including the UI. This can be successful, as it lowers the learning curve for the data practitioner writing DAGs, but it’ll also limit the Airflow features that’ll be exposed. Moreover, it’ll introduce handoff processes for Day 2 concerns of a given DAG.

Lastly, monolithic environments will make it harder to upgrade to later versions of Airflow, which slows productivity of teams who have a use for newer features. Maintenance windows need to be coordinated between business units, and CI/CD systems will need to test for edge cases across a much larger set of use cases.

A much better approach is to provide a different Airflow environment per team, with corresponding dev/stage/prod environments. This lets each team operate in a way that fits their own needs for everything from python environments and packages to upgrade schedules (and infrastructure needs! See section above). Additionally, this often involves less work from core data platform teams as it allows users to have access to Airflow’s full gamut of features.

SDLC Matters

At the end of the day, Airflow is a “pipelines as code” tool. As such, data pipelines need a developer experience and a path to production the same way any other piece of code does. Like the infrastructure section above, there’s no universally “correct” way to do CI/CD with data pipelines. However, there are universal principles:

The more complex part of this is testing DAGs. The new dag.test() command has made this considerably easier, but it’s not a complete solution. There’s still work to be done on finding the best way to test a DAG in terms of the underlying data it is moving. There’s a lot to say here, but the folks at Etsy are going to be presenting a great solution at Airflow Summit on this topic!

Focus on Providing Interfaces (This includes new Airflow features!!)

In addition to core reliability and scalability, data platform teams should focus on providing interfaces that make users more productive. Some argue these interfaces should be most of what data engineers as a whole focus on. When Airflow is originally brought into a team, there are usually users who are comfortable with Python and can use Airflow’s operators and providers to write DAGs. As the set of users and use cases grow, it’s important to add additional interfaces that meet each of these users where they are at.

Once again, there’s a lot inside of Airflow OSS that helps with this, particularly the Taskflow API and decorators. However, simply adding a more pythonic interface isn’t always enough. Many teams need declarative approaches that abstracts knowledge of Airflow, and sometimes even Python away. Or perhaps they need something specifically for data science/machine learning workloads.

To keep developers productive, data platform teams should look to serve users opinionated DAG writing frameworks. Although none exists in core Airflow, there are several in the community that can be adopted.

These libraries are great not only because they can be mixed and matched (it’s not an all or nothing approach), but also because they often take advantage of new features within core Apache Airflow.

More can be found on the Airflow Ecosystem page. This is especially powerful when combined with the “multiple Airflows” approach as it gives each team can pick and choose the interface they need.

TLDR

If you’re a data infrastructure team tasked with providing an Airflow service, there’s a lot of things you’ll have to consider. As you work through the specifics of your use case, some best practices to take into account boil down to:

Stay tuned for PT II where we’ll talk more about testing, dependencies between teams, and deploying DAGs.

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.