Airflow 2.5, which drops today, brings valuable new features — including improvements to Airflow’s dynamic task mapping and data-dependent scheduling features — and delivers noteworthy performance improvements, along with fixing a lot of bugs.
A faster release cadence is a win-win for Apache Airflow users and maintainers alike. Users get rapid access to valuable new features, along with stability-enhancing bug fixes and security patches, with everything consolidated into a single deliverable. And Apache Airflow contributors can focus on building and finalizing a small number of incremental changes in each significant release. Short cycles keep these releases lean and tight; long cycles tend to lead to large releases, introducing instability and feature bloat.
Airflow 2.5: What’s New and Improved
Airflow 2.5 is proof positive of the benefits of shorter release cycles. One of its headline improvements is a redesigned
airflow dag test feature that is an order of magnitude faster than an earlier implementation, and much more useful.
airflow dag test would trigger a cold start of a local instance of the Airflow scheduler — a time-consuming activity that delayed DAG test runs. On top of this, the task logs generated by the
airflow dag test command were buried along with other files inside a confusing folder hierarchy, forcing authors to use tools like find or grep to search for them.
With the revamped
airflow dag test command, Airflow developers went back to first principles, in effect creating a local Airflow
for loop that checks for DAG tests and automatically runs them, looping continuously, and printing errors in the Airflow CLI’s console. “By doing that, we saw a massive speed up, because you don’t have to worry about things like SLAs, timetables, or sharing resources with other DAGs,” says Daniel Imberman, Strategy Engineer with Astronomer, who led the effort to redesign
airflow dag test. The command’s output is a lot more useful, too, meaning authors no longer have to hunt for the information they need.
Another new feature in the 2.5 release lets users annotate task instances, enabling ops personnel to append useful notes to task failures or other anomalies, documenting the steps they took to deal with them. For example, if a daily data preparation DAG run fails, an ops engineer could rerun the task manually, appending a note that says “Scheduled a manual run because of a previous failure.” This is useful for organizations that implement operational controls for policy or compliance purposes.
Airflow’s data-dependent scheduling feature, which debuted less than three months ago, gets a slew of enhancements in Airflow 2.5. It’s not only much easier to search for datasets, but the Airflow UI now displays useful, in-context information about them — for example, about when a dataset was last updated, or how many times it’s been updated. In addition, the dataset dependency view in the Airflow UI is less cluttered: previously, this view would display upstream and downstream dependencies for all datasets. Now, clicking on a specific dataset shows only the datasets that are directly upstream or downstream from it.
Airflow 2.5 also introduces dozens of improvements to Airflow’s dynamic task mapping feature and a new task log auto tailing feature, which automatically refreshes the task log view in Airflow’s UI as new log entries get added. (Previously, users had to manually refresh if they wanted to see the updated task log.) And a subtle UI improvement now allows users to adjust the size of the Airflow grid view, such that their adjustments “stick” between sessions.
Why You Need to Stay Up to Date with Airflow
Short and fast release cycles are growing more popular in software engineering. Kubernetes (K8s) and Apache Kafka ship a minimum of three new releases each year, but other projects, like Apache Spark (which releases twice annually), have also retooled to ship at an accelerated cadence.
A less obvious benefit of a fast release cadence is that it gives Apache Airflow maintainers a pattern they can use to continuously build, introduce, and iteratively improve features. The proof of this is in Airflow’s dynamic task mapping capability, which was a focus of developer time and effort in versions 2.4 and 2.5. Compared to the stripped-down debut of this feature in Airflow 2.3, dynamic task mapping in Airflow 2.5 supports more and varied input types, and boasts better interoperability with XCom.
From simple beginnings, and in a surprisingly short time, basic Airflow capabilities tend to evolve to become more complex, useful, and powerful.
Given these facts, it’s important to stay as close to current-stable Airflow as possible. Falling behind means missing out on valuable features, important stability and performance improvements, and critical security patches. Organizations that depend on Airflow, K8s, Kafka, and other rapidly evolving open-source projects are learning to develop the capabilities they need to deploy new releases at a regular cadence.
We’d be remiss if we didn’t point out that Astro, the fully managed, cloud-native service powered by Airflow, makes managing Airflow upgrades easy. Today, almost two-thirds of Astro customers are running Airflow 2.4, which just dropped in September. In Astro, Airflow upgrades can be performed in-place, and the upgrading process is as simple as updating a single line in a Dockerfile. Astro provides same-day access to new Airflow capabilities and an easy way to start using them. Astro Runtime 7.0 (based on Airflow 2.5) is available in Astro today.