10 Best Practices for Modern Data Orchestration with Airflow

  • Steve Swoyer

This article identifies a core set of best practices conducive to standing up, scaling, and growing a sustainable enterprise data integration ecosystem founded on Apache Airflow — an ecosystem that supports a wide range of operational applications and use cases. We’ve worked with hundreds of customers that use Airflow, and these are the practices we see most associated with success in accelerating the flow of trusted data across organizations. We think of them as required components of modern data orchestration.

1. Standardize your Airflow prod environment

The best way to scale your Airflow production environment is by taking advantage of containerization and running Airflow on Kubernetes.

This container-based deployment pattern gives you built-in reusability — your Airflow tasks will run in K8s pods, and DAG code is managed centrally via a Docker image registry — and also simplifies versioning and maintenance. It is proven as a means to deploy, maintain, and scale Airflow in production. It ensures your DAGs run on resilient architecture as well as behave and perform the same each and every time they run.

Granted, this is not exactly a lightweight deployment pattern, infrastructure-wise. In addition to standard Airflow components like the scheduler, web server, and database, you will need to install, configure, and maintain a K8s cluster and potentially other supporting infrastructure to scale your Airflow instance. You will have to keep up with any changes to this stack, not only with respect to version updates, but big fixes and patches for security vulnerabilities, too.

This is why you might consider using an Airflow infrastructure service. With Airflow-in-the-cloud, the service provider handles all of these problems for you, so that you can focus on your pipelines.

2. Standardize your Airflow dev environment, too

A container-based pattern is the best way to scale and maintain pipeline development on Airflow, too.

For one thing, running your DAGs in a containerized Airflow environment simplifies software dependency management, which is one of the most challenging aspects of using Python at scale. Broken dependencies force data engineers to design their DAGs to least-common-denominator templates, resulting in slow, inefficient task execution and — in some cases — brittle orchestration.

For example, many Python packages rely on extensions to boost performance. Sometimes these are written in C, which means they must be compiled for a specific instruction set architecture (x86) and application binary interface (Linux). If your DAGs expect to use C extensions with Numpy, Scikit-learn, or Pandas, these extensions have to be available in their local Python environment. However, if you deploy your Airflow DAGs in containers, then you can build all of the software you need into your runtime images.

At Astronomer, we’ve made this even easier. Our open-source Astro CLI standardizes this process, integrating with your CI/CD toolchain and giving users a local dev environment — complete with web server, Airflow scheduler, Postgres database, and local executor — in which they can build, test, and deploy their Airflow DAG images. Users start by creating an Astro Project, which creates folders just for their DAG files, plugins, and dependencies. Astro Project integrates with Git, which enables users to clone project repositories, checkout specific branches of code (e.g., DAGs, tasks, operators) and, if necessary, change or customize them. This makes it easier to find and reuse trusted code in the images they build and test locally. Once they’re satisfied with the features and reproducibility of their images, they can use their CI/CD tool to deploy that code to their production Airflow environment. If you’re using our Astro platform, users can also use the Astro CLI to deploy directly to their Astro Airflow environments.

The Astro CLI is available under the Apache 2.0 license. As a general rule, the DAGs, tasks, open-source Airflow operators, and data pipeline code you build into your containers will be portable across multiple environments, from our own Astro managed service to other Airflow cloud infrastructure services, the infrastructure-as-a-service (IaaS) cloud, or K8s running in your own on-prem data center.

3. Get current-ish and keep current-ish

Not staying up-to-date with current releases may cost you money in the long run. Your Airflow environment may become increasingly difficult to scale, comparatively difficult to develop for, and increasingly expensive to maintain.

The first and most obvious reason for this is the inevitable need for bug fixes and, especially, security patches. There have been over 3,100 commits since Airflow 2.0, the vast majority of which are bug and security fixes. Keeping current avoids having to troubleshoot issues that someone else has already tripped over and that the Airflow community has already resolved. Keeping current makes it easier to respond to emergent issues — for example, a security vulnerability in a library or a package Airflow depends on — that need to be patched ASAP.

Staying current also makes it easier to keep up with feature enhancements in Airflow. As a locus of continuous innovation, Airflow continues to mature rapidly. Sixteen months ago, Airflow 2.0 introduced a new high-availability scheduler that eliminated a preexisting single point of failure and improved performance — especially for short-running tasks. Version 2.0 also introduced the new Taskflow API, which makes it easier to move data between tasks, as in an ETL or ELT workflow. A few months prior to that, Airflow 1.10 introduced improved support for managing cross-DAG dependencies via its new ExternalTaskSensor and ExternalTaskMarker sensors. This makes it much easier to break up monolithic DAGs to spin out concurrent tasks and to better take advantage of Airflow’s built-in parallelism. And just six months ago, Airflow 2.2 introduced support for deferrable operators, making it easier for data engineers to design tasks that run asynchronously.

Airflow 2.3 brings with it support for a vitally important new feature: dynamic task mapping. And there’s a great deal more in the feature pipeline, too, including support for dynamic task groups.

In other words: it’s a good idea to get current and stay current on Airflow, and to make sure that the way you design your orchestration platform makes it as simple as possible to do this.

4. Design your DAGs to take advantage of Airflow’s built-in parallel processing…

Prior to version 2, it was not especially easy to parallelize DAGs in Airflow: it was usually easier, and sometimes necessary, to combine all tasks into large, monolithic DAGs. This made it difficult to break up tasks that were not dependent on one another in order to schedule them to run at the same time — a difficulty that, in practice, usually meant some tasks had to wait on others to finish before firing off.

Airflow 2.0 introduced a surfeit of changes that helped improve performance, starting with a revamped Airflow scheduler. In earlier versions, the Airflow scheduler had trouble with short-running tasks, resulting in higher-than-desirable scheduling latency; today, micro-batch processing isn’t just possible in Airflow, it’s an established pattern. More recently, Airflow 2.2 introduced support for deferrable operators, which simplifies how Airflow schedules and manages long-running tasks. In May 2022, Airflow 2.3 introduced support for dynamic task mapping, which gives Airflow’s scheduler improved capabilities for initiating and dynamically scheduling tasks, as well as for maximizing Airflow’s built-in parallelism. With each new release, Airflow has been making it easier for you to design your DAGs to schedule non-dependent tasks to run concurrently.

These and other improvements make it even easier for you to design data pipelines that mirror your business workflows.

5. …And to push workload processing “out” closer to where your data lives

Most of us know data engineers who design their DAGs to extract and bulk-move large volumes of data. To an extent, this is consistent with the adoption of extract, transform, and load (ELT) as the preferred pattern for acquiring and using data with Airflow. However, this preference for ELT should be balanced against other factors, not the least of which is cost. In the on-premises-only data center, moving large volumes of data was merely wasteful; in the cloud, moving large volumes of data between cloud regions (or outside of a provider’s cloud environment) is both wasteful and costly.

Reducing costs and improving efficiency are the two main reasons you want to “push” the actual processing of your data engineering workload “out” closer to where the data you’re working on lives. Airflow has dedicated providers for Databricks, dbt, Google BigQuery, Fivetran, MySQL, PostgreSQL, Oracle, Redshift, Snowflake, SQL Server, as well as hooks for AWS S3, Docker, HDFS, Hive, MySQL, PostgreSQL, Presto, Samba, Slack, SQLite, and others.

These and similar providers make it easier to schedule and run tasks in upstream compute engines. Collectively, Airflow provides dozens of operators that make it easier to move or copy data between cloud object storage services (such as AWS S3 and Google Cloud Storage) and local cloud engines. To cite a few examples: You can use Airflow’s S3CopyObjectOperator to create a copy of an object (e.g., a Parquet file) already stored in S3, or the S3toSnowflakeOperator to automatically load data from S3 into a Snowflake virtual database. Similarly, Airflow’s “S3 to Redshift Transfer Operator” makes it easier to copy data from S3 into AWS Redshift. The availability of these tools (and of dozens of others like them) makes it simple to use Airflow to schedule and orchestrate these push-out operations.

6. Design your Airflow environments for “micro-orchestration”

At Astronomer, we frequently see customers starting with a “macro-orchestration” mindset, opting to create one big Airflow environment and expecting that it will serve their needs. Over time, they usually end up creating separate team- or use case-specific Airflow environments to complement the big one.

A monolithic orchestration environment isn’t supple enough to accommodate the diverse requirements of all users, especially self-service users. This is why we believe “micro-orchestration” is the way to go.

For example, you might create multiple, distributed Airflow environments to support individual regions, business units, and/or teams. This logic of distribution has things in common with both established microservices and the emerging data mesh architecture. As with building microservices, the idea is to customize each Airflow environment to address a set of usually discrete or function-specific requirements: e.g., the requirements of a specific business functional area, or of separate data engineering, data science, ML engineering, etc. teams. At Astronomer, we work with customers who have hundreds (and are well on their way to having thousands) of Airflow environments distributed across their organizations.

Creating multiple, use case- or function-specific Airflow environments is also roughly consistent with data mesh architecture and its emphasis on decentralizing responsibility for the creation, maintenance, and ownership of different kinds of data products — the parallel here being the DAGs, tasks, custom operators, and custom code each team uses in its data pipelines. In a decentralized scheme, each team has at least some freedom to develop its own local knowledge and to pursue local priorities. Its primary responsibility is to ensure its data products conform to the standards formalized by the organization.

If you embrace micro-orchestration, you’re going to need to strictly follow Best Practices Nos. 7 & 8. Codifying standards for common or recurring DAGs, tasks, custom operators, and other code ensures consistency and promotes reuse, as does integrating Airflow development with your CI/CD processes.

Remember, too, that spinning up Airflow on internal infrastructure isn’t a turnkey task — and that maintaining and securing multiple, distributed Airflow environments is challenging. So, too, is the problem of obtaining a single view of your data pipelines and the data flows they support. Each of these issues underscores the benefit of deploying Airflow in the context of a fully managed service.

7. Maximize reuse and reproducibility

Formalizing standards for reuse makes it easier for your decentralized teams — and self-service users, in particular — to use Airflow collaboratively. You might standardize common Airflow custom operators to ensure that they behave the same way every time they run. Or you might identify and productionize different kinds of recurring data integrations — e.g., recurring data movement and/or transformation tasks, feature extractions, etc. — so they can be accessed and reused by self-service users. It’s worth noting that a managed cloud infrastructure service like Astro can help simplify this by imposing a structure (as well as exposing built-in controls and enforcing standards) that provides some of the supporting conditions you need to achieve reuse and reproducibility.

So, a managed service can give you an assist, but it is still incumbent upon you to be disciplined about promoting a culture that prioritizes reuse and reproducibility. For example, at Astronomer, our in-house data team has a mandate to identify and formalize useful custom operators, tasks, scripts, and other code so we can make them available to all potential users. One obvious benefit is that everybody uses the same custom operators, and these operators behave the same way every time they run. Another is that we don’t have to worry about maintaining hundreds of custom operators — or fixing them when (if libraries, APIs, or other dependencies change) they break.

By formalizing and standardizing reusable operators, tasks, and custom data pipeline code, and by instantiating them in a managed context of some kind — your version control system, a dedicated feature store, etc. — you likewise promote reuse, reproducibility, and trust.

8. Integrate Airflow with your CI/CD tools and processes

Internally at Astronomer, we version and maintain Airflow DAGs in our enterprise GitHub repository. These are combined into an image with the Astro Runtime within our Astro CLI. We publish new Airflow images — basically Airflow + data engineering code + dependencies — to our image registry.

However, we also try to identify reusable code and add this to the Astro projects that are maintained in our GitHub version control system and managed by CI/CD. Reusable code can include not only DAGs, tasks, and custom operators, but any custom code, such as Python scripts or SQL scripts, that gets called by these operators.

In addition to promoting reuse, this enables us to version our tasks, data pipelines, and custom code, and simplifies maintenance. Data engineers have a single place to go — our enterprise GitHub repository — for the reusable tasks and data pipeline code they use in their DAGs. This data pipeline logic is always up to date. Once a data engineer pulls the latest Astro Project from our enterprise GitHub repository and updates her DAGs or tasks, she can build and publish a new image to our image registry in Astro with Astro CLI. From there, the new image gets picked up and moves through our CI/CD process.

Chances are, your data engineers, data scientists, and other experts are continuously reinventing the wheel, e.g., maintaining different kinds of custom templates to create their tasks and pipeline code (or, worse still, rewriting this code from scratch). In this case, do as we say and as we do: integrate your Airflow development with your CI/CD and look to reuse as much of your Airflow-related code as possible.

9. Use Airflow’s Taskflow API to move data between tasks

In practice, DAG authors commonly use XCom as a mechanism to move data between tasks.

Starting with Airflow 2.0, the new Taskflow API provides an abstracted, programmatic means to pass data between tasks within a DAG — for example, as part of an ETL- or ELT-type workflow. Behind the scenes, Taskflow still uses XCom; its API just abstracts this dependency.

Taskflow is useful for moving moderate amounts of data; in general, however, you don’t want to use it to move large volumes of data. For that, try to take advantage of Airflow operators that are optimized for cloud infrastructure services (e.g., CloudDataTransferServiceCreateJobOperator for Google Cloud, or S3CopyObjectOperator for AWS) or for specific cloud SaaS platforms (e.g., the S3toSnowflakeOperator simplifies loading data from S3 into Snowflake).

Another important Airflow change — as of v1.10.12 — is that Taskflow and XCom are no longer tied to Airflow’s Postgres RDBMS backend: Now you can create custom XCom backends to support larger data sets. For example, you can use S3 as a backend to serialize data as it is passed (via XComs) between tasks. For this use case, an object storage layer such as S3 performs better than PostgreSQL. The upshot is that this gives you more control over where your data gets stored between tasks. Keep in mind, too, that data dependency management is a focus of active and ongoing innovation in Apache Airflow.

The takeaway: Use Airflow 2.0’s Taskflow API when you want to move moderate amounts of data between tasks. It provides a better, more intuitive, and more functional way to use XComs to transfer data. If you are moving very large volumes of data — and/or if you are designing scalable, idempotent ETL/ELT data flows — design your DAGs to use Airflow’s cloud-specific operators to persist data to intermediary (i.e., local) cloud storage between tasks.

10. Eyes on the prize: observability and modern data orchestration

Modern data orchestration does two things to help you see and make sense of your data flows.

First, it gives you low-level observability into your data pipelines — i.e., into the point-to-point connections that comprise the data-delivery plumbing of your business. It allows the people who need to work at this level of detail to quickly and easily identify problems at their sources and devise remediations. This requires lineage and understanding at the data-set level.

Second, it gives you the ability to observe, understand, and abstract your data pipelines as data flows, meaning the networks of data pipelines that knit together and deliver data across the distinct regions, functional areas, decentralized teams, and practices that comprise your organization. This requires lineage and understanding at the process level.

Both are important to understanding the health of your data ecosystem, and orchestration provides the pivotal vantage point. For example, your data engineers and ops people need to be able to see into the nitty-gritty of your data plumbing in order to do their work, but other groups of experts expect to think and work with higher-level abstractions. By making it possible for you to map your data flows to your business workflows, modern data orchestration enables these experts to observe, understand, and improve your existing data products, as well as optimize the delivery of the time-sensitive information supporting your operational decision-making. It gives experts a conceptual starting point for transforming and optimizing not just data flows, but, if applicable, the workflows they are interpenetrated with.

In the same way, modern data orchestration makes it easier for your experts to design and deliver completely new data flows to support novel customer-facing products and services, or to enable expansion into new markets and regions.

All of this begins with the ability to capture and analyze the lineage metadata that gets generated each time your Airflow DAGs run. In other words, get serious about observability: look to capture and analyze lineage metadata, with the goal of developing this capacity to observe and conceptualize your data pipelines as different types of business-oriented abstractions. To cite one example, Astronomer’s Astro platform leverages OpenLineage, an open-source standard for data lineage and other types of metadata. Astro uses OpenLineage to automatically extract data lineage events as your DAGs run.

Conclusion: The challenge, and the opportunity, of modern data orchestration

Viewed up close, the five or 50 or hundreds or thousands of Airflow-powered pipelines that currently pump data through your organization might look like a confused tangle of wiring. But modern data orchestration replaces this tangled, pipeline-centric view of scheduling and dependency management by abstracting your data pipelines as data flows. And it allows you to “orchestrate” these data flows much as a conductor interprets and directs the flow of music, shaping its phrasing, managing the entry and exit of players and soloists, and so on.

Think of each of the best practices described above as building on and reinforcing the others in this regard. For example, if you run your production DAGs in containers, it becomes practicable to build into them (as well as version) the instrumentation required to generate data lineage. And if you manage your Airflow runtime instances from a central control plane, it’s much easier to capture and analyze this metadata. (In fact, our Astro cloud service does this automatically for you.) Similarly, if you put your reusable code — e.g., tasks, operators, and custom data pipeline logic — into your Git repository, it not only simplifies maintenance, but becomes easier to make this code available (for reuse) by internal practitioners. If you design your DAGs to take advantage of Airflow’s built-in parallelism, your tasks will not only run better and faster, but the dependencies between them will be less brittle and more robust. And if you keep as current as possible on Airflow, you’re always in a position to benefit from improvements in new releases.

This is easier said than done, of course. But it’s exactly what we do with our own data pipelines at Astronomer, and it’s how we’ve designed — and plan to keep evolving — our Astro platform. The beauty of Astro is that, as a fully managed service, it makes this capacity to observe, understand, and orchestrate data flows available to organizations of all sizes. It doesn’t matter if you use Airflow to schedule just a few dozen DAGs or hundreds of DAGs: Astro enables modern data orchestration for organizations of any size.

If you’d like to learn more about how to optimize and scale Airflow — get in touch with our experts today.

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.