Apache Airflow was created by Airbnb’s Maxime Beauchemin as an open-source project in late 2014. It was brought into the Apache Software Foundation’s Incubator Program in March 2016 and saw growing success in the wake of Maxime’s well-known “The Rise of the Data Engineer” blog post. By January of 2019, Airflow was announced as a Top-Level Apache Project by the Foundation and is now concretely considered the industry’s leading workflow orchestration solution.
Airflow’s strength as a tool for dataflow automation has grown for a few reasons:
1. Proven core functionality for data pipelining. Airflow competitively delivers in scheduling, scalable task execution and UI-based task management and monitoring.
2. An extensible framework. Airflow was designed to make data integration between systems easy. Today it supports over 55 providers, including AWS, GCP, Microsoft Azure, Salesforce, Slack and Snowflake. Its ability to meet the needs of simple and complex use cases alike make it both easy to adopt and scale.
3. A large, vibrant community. Airflow boasts thousands of users and over 1,600 contributors who regularly submit features, plugins, content and bug fixes to ensure continuous momentum and improvement. In 2020, Airflow reached 10,000 commits and 18,000 GitHub stars.
As Apache Airflow grows in adoption, there’s no question that a major release to expand on the project’s core strengths has been long overdue. As users and members of the community, we at Astronomer are delighted to announce that Airflow 2.0 is in the alpha testing stage and is scheduled to be generally available in December of 2020.
Over the last year, various organizations and leaders within the Airflow Community have been in close collaboration refining the scope of Airflow 2.0 and actively working towards enhancing existing functionality and introducing changes to make Airflow faster, more reliable and more performant at scale.
In preparation for the highly anticipated release, we’ve put together an overview of major Airflow 2.0 features below. We’ll publish a series of followup posts over the next few weeks that dive deeper into some of those changes.
Major Features in Airflow 2.0
Airflow 2.0 includes hundreds of features and bug fixes both large and small. Many of the significant improvements were influenced and inspired by feedback from Airflow's 2019 Community Survey, which garnered over 300 responses.
A New Scheduler: Low-Latency + High-Availability
The Airflow Scheduler as a core component has been key to the growth and success of the project following its creation in 2014. As Airflow matures and the number of users running hundreds of thousands of tasks grows, however, we at Astronomer saw great opportunity in driving a dedicated effort to improve upon Scheduler functionality and push Airflow to a new level of scalability.
In fact, "Scheduler Performance" was the most asked for improvement in the Community Survey. Airflow users have found that while the Celery and Kubernetes Executors allow for task execution at scale, the Scheduler often limits the speed at which tasks are scheduled and queued for execution. While effects vary across use cases, it's not unusual for users to grapple with induced downtime and a long recovery in the case of a failure and experience high latency between short-running tasks.
It is for that reason that we’re beyond ecstatic to introduce a new, refactored Scheduler with the Airflow 2.0 release. The most impactful Airflow 2.0 change in this area is support for running multiple schedulers concurrently in an active/active model. Coupled with DAG Serialization, Airflow’s refactored Scheduler is now highly available, significantly faster and infinitely scalable. Here's a quick overview of new functionality:
1. Horizontal Scalability. If task load on 1 Scheduler increases, a user can now launch additional "replicas" of the Scheduler to increase the throughput of their Airflow Deployment.
2. Lowered Task Latency. In Airflow 2.0, even a single scheduler has proven to schedule tasks at much faster speeds with the same level of CPU and Memory.
3. Zero Recovery Time. Users running 2+ Schedulers will see zero downtime and no recovery time in the case of a failure.
4. Easier Maintenance. The Airflow 2.0 model allows users to make changes to individual schedulers without impacting the rest and inducing downtime.
The Scheduler's now-zero recovery time and readiness for scale eliminates it as a single point of failure within Apache Airflow. Given the importance of this change, we'll be putting out a series of followup blog posts that dive deeper into the story behind these improvements alongside an architecture overview and benchmark metrics.
Full REST API
Data engineers have been using Airflow’s “Experimental API” for years, most often for triggering DAG runs programmatically. With that said, the API has historically remained narrow in scope and lacked critical elements of functionality, including a robust authorization and permissions framework.
Airflow 2.0 introduces a new, comprehensive REST API that sets a strong foundation for a new Airflow UI and CLI in the future. Additionally, the new API:
- Makes for easy access by third-parties
- Is based on the Swagger/OpenAPI Spec
- Implements CRUD (Create, Update, Delete) operations on all Airflow resources and
- Includes authorization capabilities (parallel to those of the Airflow UI)
These capabilities enable a variety of use cases and create new opportunities for automation. For example, users now have the ability to programmatically set Connections and Variables, show import errors, create Pools, and monitor the status of the Metadata Database and Scheduler.
For more information, you can reference REST API documentation here.
In the context of dependency management in Airflow, it’s been common for data engineers to design data pipelines that employ Sensors. Sensors are a special kind of Airflow Operator whose purpose is to wait on a particular trigger, such as a file landing at an expected location or an external task completing successfully. Although Sensors are idle for most of their execution time, they nonetheless hold a “worker slot” that can cost significant CPU and memory.
The “Smart Sensor” introduced in Airflow 2.0 is an “early access” (subject to change) foundational feature that:
- Executes as a single, “long running task”
- Checks the status of a batch of Sensor tasks
- Stores sensor status information in Airflow’s Metadata DB
This feature was proposed and contributed by Airbnb based on their experience running an impressively large Airflow Deployment with tens of thousands of DAGs. For them, Smart Sensors reduced the number of occupied worker slots by over 50% for concurrent loads in peak traffic.
To learn more, refer to the Airflow docs on Smart Sensors here.
While Airflow has historically shined in scheduling and running idempotent tasks, it has historically lacked a simple way to pass information between tasks. Let's say you are writing a DAG to train some set of Machine Learning models. A first set of tasks in that DAG generates an identifier for each model and a second set of tasks outputs the results generated by each of those models. In this scenario, what's the best way to pass output from those first set of tasks to the latter?
Historically, XComs have been the standard way to pass information between tasks and would be the most appropriate method to tackle the use case above. As most users know, however, XComs are often cumbersome to use and require redundant boilerplate code to set return variables at the end of a task and retrieve them in downstream tasks.
With Airflow 2.0, we're excited to introduce the TaskFlow API and Task Decorator to address this challenge. The TaskFlow API implemented in 2.0 makes DAGs significantly easier to write by abstracting the task and dependency management layer from users. Here's a breakdown of incoming functionality:
1. A framework that automatically creates PythonOperator tasks from Python functions and handles variable passing. Now, variables such as Python Dictionaries can simply be passed between tasks as return and input variables for cleaner and more efficient code.
2. Task dependencies are abstracted and inferred as a result of the Python function invocation. This again makes for much cleaner and more simple DAG writing for all users.
3. Support for Custom XCom Backends.
Airflow 2.0 includes support for a new
xcom_backend parameter that will allow users to pass even more objects between tasks. Out-of-the-box support for S3, HDFS and other tools is coming soon.
It's worth noting that the underlying mechanism here is still XCom and data is still stored in Airflow’s Metadata Database, but the XCom operation itself is hidden inside the PythonOperator and is completely abstracted from the DAG developer. Now, Airflow users can pass information and manage dependencies between tasks in a standardized Pythonic manner for cleaner and more efficient code.
Airflow SubDAGs have long been limited in their ability to provide users with an easy way to manage a large number of tasks. The lack of parallelism coupled with confusion around the fact that SubDAG tasks can only be executed by the Sequential Executor, regardless of which Executor is employed for all other tasks, made for a challenging and unreliable user experience.
Airflow 2.0 introduces Task Groups as a UI construct that doesn’t affect task execution behaviour but fulfills the primary purpose of SubDAGs. Task Groups give a DAG author the management benefits of “grouping” a logical set of tasks with one another without having to look at or process those tasks any differently.
While Airflow 2.0 will continue to support the SubDAG Operator, Task Groups are intended to replace it in the long-term.
One of Airflow’s signature strengths is its sizable collection of community-built Operators, Hooks, and Sensors - all of which enable users to integrate with external systems like AWS, GCP, Microsoft Azure, Snowflake, Slack and many more.
Providers have historically been bundled into the core Airflow distribution and versioned alongside every Apache Airflow release. As of Airflow 2.0, they are now split into its own airflow/providers directory such that they can be released and versioned independently from the core Apache Airflow distribution. Cloud service release schedules often don’t align with the Airflow release schedule and either result in incompatibility errors or prohibit users from being able to run the latest versions of certain providers. The separation in Airflow 2.0 allows the most up-to-date versions of Provider packages to be made generally available and removes their dependency on core Airflow releases.
It’s worth noting that some operators, including the Bash and Python Operators, remain in the core distribution given their widespread usage.
Simplified Kubernetes Executor
Airflow 2.0 includes a re-architecture of the Kubernetes Executor and KubernetesPodOperator, both of which allow users to dynamically launch tasks as individual Kubernetes Pods to optimize overall resource consumption.
Given the known complexity users previously had to overcome to successfully leverage the Executor and Operator, we drove a concerted effort towards simplification that ultimately involved removing over 3,000 lines of code. The changes incorporated in Airflow 2.0 make the Executor and Operator easier to understand, faster to execute and offers far more flexibility in configuration.
Data Engineers will now have access to the full Kubernetes API to create a yaml ‘podtemplatefile’ instead of being restricted to a partial set of configurations through parameters defined in the airflow.cfg file. We’ve also replaced the
executor_config dictionary with the
pod_override parameter, which takes a Kubernetes V1Pod object for a clear 1:1 override setting.
Perhaps one of the most welcomed sets of changes brought by Airflow 2.0 will be the visual refresh of the Airflow UI.
In an effort to give users a more sophisticated and intuitive front-end experience, we’ve made over 30 UX improvements over the past few months, including a new “auto-refresh” toggle in the “Graph” view that enables users to follow task execution in real-time without having to manually refresh the page (commit).
Other highlights include:
- A refreshed set of icons, colors, typography and top-level navigation (commit)
- Improved accessibility and legibility throughout (commit, commit)
- Separation of actions + links in DAG navigation (commit)
- A button to reset the DAGs view (home) after performing a search (commit)
- Refactored loading of DAGs view (e.g. remove “spinning wheels) (commit)
Many more Airflow UI changes are expected beyond Airflow 2.0, but we’re certainly excited to have gotten a head start.
We’re thrilled to finally be sharing the Airflow 2.0 alpha release with the community. The scope of the features outlined above sets an incredibly exciting foundation on top of which developers all over the world will undoubtedly build.
The Airflow 2.0 beta release is imminent and will be followed by an official release candidate for the community to test and vote on. If you're interested in following the release cycle more closely, we encourage you to track the Airflow 2.0 Planning page or sign up for the Dev Mailing List. As always, you can also join the Apache Airflow Community in Slack or follow Astronomer and ApacheAirflow on Twitter.
If you're interested in testing an Airflow 2.0 pre-release, make sure to follow at least one of the avenues above. We'll be sharing migration guidelines, testing instructions and more over the coming weeks.
Finally, please join us in sincerely thanking the many Airflow contributors who worked tirelessly to reach this milestone. In no particular order, a huge thank you goes out to: Ash Berlin-Taylor, Kaxil Naik, Jarek Potiuk, Daniel Imberman, Tomek Urbaszek, Kamil Breguła, Gerard Casas Saez, Kevin Yang, James Timmins, Yingbo Wang, Qian Yu, Ryan Hamilton and the hundreds of others for their time and effort into making Airflow what it is today.
We're excited for what's next.