May 24, 2022

To Build or to Buy? DIY Orchestration with Airflow vs. A Fully Managed Service

Steve Swoyer Senior Content Writer Astronomer

Apache Airflow® has been a boon for organizations that need a scalable framework for scheduling and running the workflows that feed data to decision-makers, practitioners, and automated services across their business. With hundreds of built-in features and capabilities that make it easy to schedule operations across distributed databases, compute engines, and other cloud services, Airflow is in a class by itself as a workflow-scheduling platform. And with its rapidly increasing popularity — from 200,000 monthly downloads in 2019 to more than 12 million in 2022 — there is a growing community of developers equipped to work with it. So it’s not surprising that so many organizations are drawn to Airflow — and, often, to deploying and customizing their own Airflow infrastructures.

But workflow management isn’t the same thing as orchestrating your data stack from top to bottom. Data orchestration is a much more complex and ambitious undertaking that makes it easier for an organization to:

observe and understand its data pipelines;
ensure seamless, secure interoperability between — and access to — resources across the business;
use as few resources in the cloud as possible;
incur a minimum of technical debt;
attract and retain top engineers, and keep them focused on their core competencies.

Anyone who has tried to put Airflow to use as a full-fledged data orchestration solution knows that, as great as it is, Airflow on its own can be tricky to scale, difficult to manage, and hard to maintain, and that it lacks a number of features essential to effective orchestration.

As we see it, building and maintaining your own Airflow infrastructure — or relying on an infrastructure-only cloud service as a basis for orchestration — is harder and more costly than most people realize. And yet plenty of organizations choose these approaches. At Astronomer, we talk to hundreds of organizations every month that are using or looking at using Airflow, so we hear a lot of the issues people consider when thinking about building rather than buying.

Below are some of the things we hear — along with our own takes on each.

Myths and misconceptions about DIY orchestration:

Free OSS Airflow is enough for orchestration.

1. Open source Airflow is free, and organizations can get what they need from it.

Successful OSS projects produce high-quality software at no upfront cost, and are complemented by dedicated forums, mailing lists, and other public resources. On this basis, you might think OSS Apache Airflow® alone could be enough to address all of your needs.

Our take: As noted above, an Airflow infrastructure is a lot to manage. In addition to Airflow itself, the most common Airflow deployment pattern consists of a web server, a Postgres RDBMS, and a Docker registry. However, because Airflow’s shared-everything architecture complicates the task of supporting large and/or diverse deployments, the best way to scale it in production is by creating containerized images that run locally — in Airflow itself — or can be deployed via external executors, such as Celery and Kubernetes (K8s). These executors introduce additional dependencies (a Redis database, a RabbitMQ message queue), too.

And despite that complexity, this stack still doesn’t give you all the pieces you need to design a scalable and sustainable data orchestration service.

For example, basic Apache Airflow® lacks built-in integration with version-control and CI/CD software, and doesn’t offer a consistent and reproducible local development environment — meaning it doesn’t offer a secure, controlled path that DAG authors can use to push their code from dev to prod.

Another big gap is that out-of-the-box Apache Airflow® doesn’t offer the equivalent of a central control plane with a single pane of glass from which an organization can manage multiple Airflow deployments. Not only does this complicate scaling — basically, it’s easier to scale Airflow horizontally (across multiple monolithic deployments) than vertically (in a single “macro” deployment) — it makes it difficult to support decentralized deployment scenarios: deployments in which an organization custom-tailors different kinds of pre-built Airflow environments to suit the needs of its diverse internal constituencies.

Such a control plane could also collect and analyze the lineage metadata that Airflow's DAGs and tasks emit each time they run. This would make it easier for data engineers, data scientists, and other expert practitioners to observe the behavior and performance of their data pipelines and to improve the quality of their data. It would also serve as a starting point for diagnosing problems with dataflows, or refactoring them in response to emergent events.

Airflow on its own offers none of that. On the other hand, when you buy Astro, the fully managed orchestration service powered by Airflow, all those “missing pieces” are built in.

And while DIY users tend to run one or more Airflow versions behind (contending, as they are, with managing, securing, and maintaining a complex infrastructure), Astro keeps you current on new features, important bug fixes, and critical security patches.

A turnkey Airflow service is overkill for smaller organizations.

2. A fully managed Airflow service is overkill for an organization that isn’t using Airflow at a large scale.

Understandably, organizations that see their needs as modest sometimes associate Airflow-as-a-service with unnecessary cost and complexity.

Our take: Astro gives you capabilities, essential to data orchestration at any scale, that were not previously available to most smaller organizations — capabilities that increase developer productivity, enable observation of both the provenance and the quality of data, and make it possible to quickly and effectively respond to data outages.

A few years ago, a large enterprise might have invested millions of dollars to engineer similar capabilities, but these days, fully managed cloud services like Astro are making formerly out-of-reach features available to everyone. Salesforce did this with Salesforce automation. Snowflake did it with the massively parallel processing data warehouse. And Astro does it with seamless, reliable, Airflow-powered orchestration.

Also, Astro was designed to scale down as well as up; whether your existing Airflow deployment runs 1,000 DAGs or just five or ten, Astro provides a turnkey Airflow infrastructure that you do not need to manage or maintain.

Another thing to consider is that even in relatively small organizations, Airflow deployments tend to become larger and more complex as the organization identifies new use cases. Typically, an organization starts small with Airflow, with one or two internal constituencies adopting it to address a specific set of needs. But as a team or business unit has success with Airflow, other internal constituencies start clamoring to use it, too.

This tendency for deployments to grow means that an Airflow infrastructure that’s manageable with just a few DAGs can easily become unmanageable, and more difficult to secure and maintain, as usage increases. The deployment patterns that organizations find convenient for standing up Airflow often turn out not to scale, and development practices they used early on to build software for Airflow start to break down as the scale and complexity of development increases. And organizations tend to take on technical debt at a fast clip as they attempt to scale out this infrastructure.

For small organizations as well as large, Astro enables orchestration that just works — orchestration that is predictable and reliable up, down, and across the entire data stack.

A fully managed service risks vendor lock-in.

3. Working with a vendor can result in being locked into their proprietary solution, as well as the software development tools and practices for which their solution is designed.

Many organizations look to DIY solutions because they’re worried about the risk of vendor lock-in. They reason that if they build on top of pure open-source software, they can take their custom code with them whatever they do.

Our take: Vendor lock-in isn't really an issue when a commercial product or service is based on open-source software. Commercial software vendors who build businesses around OSS projects — and even choose to play leading roles in OSS communities like Apache Airflow® — benefit from and depend on upstream OSS development and are incentivized to maintain strict compatibility with these projects.

So long as your DAGs, tasks, and code are compliant with OSS Airflow, they will be portable to open-source Apache Airflow® running in any infrastructure context: the infrastructure-as-a-service (IaaS) cloud, your on-premises data center, a cloud Airflow infrastructure service like MWAA, or a fully managed Airflow service like Astro. Similarly, you can easily transition the people who built your DAGs and data pipeline logic to your new Airflow infrastructure, too. Your people and your code are portable because — in each of these scenarios — your Airflow infrastructure is 100% open source.

Particularly when an OSS project is successful and has a large and thriving community of committers, like Apache Airflow®, vendor lock-in just doesn’t happen. (Another example is Apache Kafka, the open-source stream-processing framework for which several cloud providers market infrastructure services, and Confluent — a vendor closely involved in developing and steering the project’s development — which offers fully managed services. So long as the connectors you build for Kafka maintain compatibility with Apache Kafka, they are portable across implementations.)

Engineering-first organizations should do it for themselves.

4. Engineering-first organizations have the talent and know-how to stand-up and customize their own Airflow infrastructures.

It makes sense that some engineers believe they can achieve a better, more tailored result, at a lower cost, by customizing Airflow themselves rather than turning to a vendor. They reason that OSS Apache Airflow® gives them free, high-quality code they can use as a foundation to build their own software with the features — CI/CD integration, lineage, etc. — that their organizations need and basic Airflow lacks.

Our take: Of course they’re smart enough to do it. The question is, does the organization really want to invest the resources required? Does it want its skilled experts building and maintaining software that does not provide competitive differentiation? Put differently, is there another way to get the same tailored results without having to pay these opportunity costs?

Doing-it-yourself may make sense if the software you’re building is a saleable product, but what if it’s something for your own use that reinvents features and functions you could otherwise buy? One proven way to manage the risks inherent in all software development — of going over-budget or over-time, or failing outright — is to narrow the focus of your software projects: build the software you can’t buy, buy the software you can’t justify building.

A related issue is that in today’s job market, even organizations that pride themselves on engineering excellence struggle to recruit and retain top talent. If you want to build, customize, and maintain software for Airflow, you need top-flight Python skills. True, because Python is so popular, there are a lot of people out there who know how to use it. But for the same reason, there is also serious competition for the most skilled among them. And since Apache Airflow® is just one component of Airflow’s stack, rolling your own Airflow also means recruiting and retaining engineers who can deploy, scale, and maintain K8s and other pieces of the infrastructure. This is a particular problem for smaller organizations, and many not-so-small ones, that depend on just one or two outstanding engineers to maintain and improve their business-critical infrastructure software. But whatever the size of the talent pool, the biggest issue for organizations is how their skilled engineers spend their time and energy. Most businesses want these people adding value — not distracted by keeping the lights on.

If you’re already with MWAA or GCC, why switch?

5. Organizations that use MWAA or GCC are already paying for a managed Airflow service, so why would they switch?

It’s not surprising that some organizations believe that an Airflow managed infrastructure service from a company like Amazon or Google will give them most of what they need, and are skeptical that a “fully managed” service will provide useful benefits on top of this.

Our take: It’s true that a cloud Airflow infrastructure service like MWAA or GCC takes care of deploying and managing your Apache Airflow® infrastructure — including the Postgres RDBMS that Airflow uses to store metadata, a webserver, an external executor (Celery), and other infrastructure — and gives you the features you need to schedule and manage your data pipelines.

But neither service has a track record of staying current with new releases of Apache Airflow®; in fact, both typically run months behind the latest stable Airflow release. MWAA, for example, not only doesn’t support deferrable tasks (available since Airflow 2.2 in autumn 2021), but still doesn’t natively support Airflow 2.0’s REST API.

Another difference is that, unlike Astro, neither GCC nor MWAA offers a simple, turnkey upgrade experience. If users want to upgrade from one version of Airflow to another in MWAA, they need to create new clusters and migrate their existing installations over; with GCC, users must upgrade both their Airflow environments and Cloud Composer. Software development in both GCC and MWAA tends to be much more complicated when compared to a platform like Astro because neither offers customers a consistent, reproducible development environment, simplified integration with version-control and CI/CD platforms, and a secure, controlled path from dev to prod.

Scaling Airflow is difficult with both services, too: GCC requires a cluster per Airflow environment; MWAA does not natively support the KubePodOperator, which constrains its ability to scale for more intensive tasks. Another big drawback is that neither MWAA nor GCC offers that central control plane — the single pane of glass — that customers can use to manage all of their Airflow deployments. And because neither MWAA or GCC gives customers a way to extract and analyze lineage metadata, they cannot provide observability across all Airflow deployments and data pipelines, to say nothing of the dataflows (orchestrated pipelines) that crisscross their organizations.

The support experience you get with Astro is substantively different from what you get with GCC and MWAA. The specialists who support Astro don’t just know the Airflow codebase, in many cases they helped build it, as committers. If a customer has a problem with Airflow’s scheduler, for example, the bulk of the committers responsible for designing and maintaining Airflow 2.x’s more scalable, available scheduler work for Astronomer. With GCC and MWAA, by contrast, customers get front-line support from general-purpose support technicians who may not be familiar with the Airflow codebase and almost certainly have not committed to it.

In comparison to rolling your own Airflow, a managed Airflow infrastructure service can save you a lot of work if you’re trying to take Airflow for a test drive. But you also sign up for a subset of the same problems that you get when you opt to deploy and maintain your own Apache Airflow® infrastructure.

If you’ve already built it, why buy a service?

6. Organizations that have already built and customized their own Airflow infrastructures don’t want to buy the same thing all over again.

When you’ve already invested hundreds of person-hours standing up your own customized Airflow implementation — which works and runs well — why would you pay for yet another Airflow solution?

Our take: Because the work is just beginning. Now you’ve got to manage and maintain your Airflow infrastructure.

Successfully deploying and customizing an Airflow infrastructure is something like adopting a puppy: Once you’ve brought it home, you have to start caring for and cleaning up after it on a daily basis. You have to train it — “integrating” it into your home — and you have to puppy-proof, securing and hardening your home against mishaps. Expect to have to nurse your puppy when it gets sick and schedule preventative maintenance to keep it healthy.

If you’re someone who helped build your organization’s customized Airflow infrastructure, you might want to ask yourself: Do I really want to support this software for the rest of my time here? Because until your custom-built Airflow infrastructure goes away, it’s your puppy.

And if you’re an executive who helped drive a project to implement and customize Apache Airflow®, you probably recognize that a lot has changed since then, and continues to, not only in Airflow itself, but in the modern data stack — and that sunk costs are rarely a sound basis for decision making. Especially when the features and benefits that you couldn’t and still can’t get with OSS Airflow are now available in a fully managed service: a service to and from which you can transfer all your DAGs, tasks, and custom code, and which you can customize to suit even highly specialized requirements (see item 7).

Security needs always rule out a cloud service.

7. All organizations have demanding security requirements, and some implement their own bespoke security infrastructure solutions; a cloud Airflow service will not work with these solutions.

An organization may sometimes think it has to build Airflow itself because its policies and/or specialized requirements preclude the use of cloud services. Or it may believe that adapting a cloud service to suit its needs would be difficult or impossible, given its proprietary security infrastructure services and/or SLA requirements.

Our take: Running Airflow in an IaaS instance in the public cloud is roughly similar to running it in IaaS in an organization’s on-prem data center, and it is usually possible to configure Airflow-in-IaaS to integrate with an organization’s security infrastructure. This is true even if the organization uses a proprietary SSO service and/or proprietary security infrastructure technologies.

If an organization requires its data/workloads to be hosted in a single-tenant cloud environment, it can deploy Airflow-in-IaaS as a virtual private cloud (VPC). It can also configure other security mechanisms (such as VPC peering) to isolate its data and workloads from public traffic. The problem with taking this approach is that the organization then assumes responsibility for deploying, managing, securing, and maintaining the Airflow infrastructure, with all the costs that entails.

A middle-path option is to deploy Airflow in a fully managed service that runs in a single-tenant IaaS instance in the public cloud that it owns. Astro, the fully managed service, provides a secure, pre-built runtime environment and offers built-in services that organizations can use to integrate Airflow’s secrets backend with their existing security infrastructures. This usually makes it practicable to integrate Airflow with the organization’s SSO and security infrastructure services. (Astro supports integration with popular SSO services such as Azure Active Directory, Google Auth, and Okta.) In addition to built-in SSO integration, Astro supports essential security features like role-based access control, VPC Peering, and IP allowlist, and support for PrivateLink (AWS) or Private Service Connect (GCP) connections permits Astro to securely connect to sensitive resources that organizations deploy in separate VPC environments, too.

The context in which Astro runs might be (for example) an EC2 instance in AWS, a Google Compute Engine instance in GCP, or an Azure Virtual Machines instance in Azure. If you are able to integrate single-tenant IaaS Airflow with your proprietary SSO and security infrastructure technologies, you can integrate a fully managed Airflow service with these mechanisms.

In addition to the costs — in person-hours, brainpower, and other resources — of a DIY approach to deploying your own Airflow infrastructure, there is also the question of performance. When you take into account the complexity inherent in managing, maintaining, updating, and securing an Airflow infrastructure — along with its constitutive parts — a vendor whose sole competency is in managing Airflow infrastructure is going to do a much better job with this task than an organization whose competencies (and priorities) lie elsewhere.

The case for a fully managed Airflow service

Building on top of open-source Apache Airflow® is not the same as building on top of a scalable, manageable, sustainable data orchestration platform. There are two ways you can get this kind of platform: by building your own software to provide the features basic Apache Airflow® lacks, or by buying them with a fully managed Airflow service.
A fully managed Airflow service will in most cases be a better fit for an organization than an on-premises or IaaS deployment, regardless of the size and complexity of its workloads. It will save time and money whether it deploys just a few DAGs or hundreds of DAGs with a fully managed Airflow service.
In an era of cloud services that maintain strict compatibility with open source projects, vendor lock-in isn’t the danger it once was. You can always move from one commercial Airflow service to another, or to an Airflow infrastructure you deploy on your own — and take both your code and your people with you.
For an organization that prides itself on engineering excellence, it generally makes more sense to focus its talent on custom-engineering business-critical software than on maintaining infrastructure. This is even more important in a climate of intense competition for talented developers, data engineers, and other skilled technicians.
Managed infrastructure services like MWAA and GCC automate much of the work involved in deploying and maintaining Airflow and its infrastructure dependencies. But these kinds of Airflow infrastructure services do not give you a central control plane you can use to observe and control your Airflow deployments — or any of the other pieces you will need to achieve sustainable data orchestration at scale.
It may make sense to custom-build your own Airflow infrastructure if your needs are highly specialized, or if you believe you can realize some useful, if temporary, business benefits. But ultimately, a fully managed Airflow service is the most pragmatic and sustainable choice for business growth, providing SSO integration, version control and CI/CD integration, data lineage collection, and other key features that are time-consuming and costly to build yourself.
The fact that a fully managed Airflow service running in the IaaS cloud is, in general, as configurable as Airflow infrastructure running in the on-prem data center undercuts the claim that it is difficult or impossible to configure a cloud service to integrate with a security-conscious organization’s specialized infrastructure.

As an alternative to telling you about the benefits of a fully managed Airflow service, why not give us an opportunity to show you? Get started by requesting a demo of Astro today!

* As of May 2022, Astro is available for Amazon Web Services (AWS) and Google Cloud, with Microsoft Azure support slated for this summer.