- Everything you need to know about Airflow 2.3 article
- Airflow 2.3 release notes
- Dynamic tasks guide
- Dynamic task mapping Airflow docs
- “Dynamic Dags — The New Horizon” talk by Ash Berlin-Taylor at the Airflow Summit 2022. Register for the Summit!
1. What is Apache Airflow?
Apache Airflow is one of the world’s most popular data orchestration tools — an open-source platform that lets you programmatically author, schedule, and monitor your data pipelines.
Apache Airflow was created by Maxime Beauchemin in late 2014, and brought into the Apache Software Foundation’s Incubator Program two years later. In 2019, Airflow was announced as a Top-Level Apache Project, and it is now considered the industry’s leading workflow orchestration solution.
Key benefits of Airflow:
- Proven core functionality for data pipelining
- An extensible framework
- A large, vibrant community
2. Airflow Updates Timeline
3. What’s new in Airflow 2.3?
- 2 Major New Features
- Over 30 Minor New Features
- Over 200 Merged PRs
4. Major New Features
Dynamic Task Mapping
Dynamic task mapping allows DAGs to take a set of inputs at run time and creates a single task for each one.
This is helpful for use cases like:
- ETL for processing files or tables, without knowing their final number in advance.
- Running a particular task per customer, when the list of customers changes frequently.
- ML applications like hyperparameter tuning, or training a different set of models for each input dataset.
Benefits of Dynamic Task Mapping:
- Implement tasks that change at run time based on the state of previous tasks
- Maintain task atomicity
- Reduce top-level code in your DAG files
- Maintain history of tasks even when that particular input no longer exists
- Avoid the pitfalls of dynamic DAGs at a high scale
Dynamic Task Mapping is simple to implement with two new functions:
- expand(): This function passes the parameter or parameters that you want to map on. A separate parallel task will be created for each input.
- partial(): This function passes any parameters that remain constant across all mapped tasks generated by expand().
The following task code example uses both of these functions to dynamically generate 3 task runs:
@task def add(x: int, y: int): return x + y added_values = add.partial(y=10).expand(x=[1, 2, 3])
The grid view is a UI update replacing the tree view, focusing on task history. The grid view shows runs and tasks but leaves dependency lines to the graph view, and handles task groups better.
Example of an ELT task grid view:
Benefits of the grid view:
- Focus on task history
- More granularity and control than the tree view
- Dependencies and structure within the graph view.
- Durations of DAG runs visible to quickly inspect performance changes.
- Summaries for task groups.
- Play icon to better indicate manual DAG runs.
Begins at 15:15 in the webinar video.
Demo example: A task called “make_list” makes a list to map over, returning a list of some random number of integers between two and four. That list then gets passed to the downstream task that is going to take it and get mapped accordingly.
The graph view of two tasks:
More examples discussed in the Demo and Q&A.
6. Minor New Features
- BranchPythonOperator Decorator added
- ShortCircuitOperator configurability updated to respecting downstream trigger rules
- DummyOperator replaced with EmptyOperator
- New show_return_value_in_logs parameter for PythonOpertor
- SmoothOperator added
- The requirement that custom connection UI fields be prefixed removed
- JSON serialization for connections enabled
Other great additions
- New ALL_SKIPPED trigger rule
- New Events custom timetable
- New Airflow db downgrade CLI command
- dag_id_pattern parameter added to the /dags endpoint