Skip to main content
Version: Airflow 3.x

DAG Versioning and DAG Bundles

DAG versioning, the most frequently requested feature by the Airflow community, is available in Airflow 3.0! This feature allows you to track changes to your DAGs over time in the Airflow UI, allowing you to see the complete history of your DAG runs. DAG versioning is automatic and does not necessitate any setup. Additionally, versioned DAG bundles allow you to prevent version collisions during code pushes and rerun historical DAGs using their original code.

This guide gives an introduction to DAG versioning and DAG bundles, including how to set up a versioned GitDagBundle.

Assumed knowledge

To get the most out of this guide, you should have an existing knowledge of:

Importance of DAG Versioning

In Airflow 2, both the Airflow UI and DAG execution always used the latest DAG code. This led to two major constraints:

  • No observability of previous DAG versions: If you changed a DAG and then, for example, removed a task, all history for previous runs of that task disappeared in the grid and graph view of the Airflow UI.
  • Version collisions during code pushes: If the code of a DAG changed while a DAG was still running, some tasks of the same run might have been executed using the older version while others used the newer version. This situation carried a significant risk of unintended consequences. For example:
    • The older DAG version might use task A to retrieve the name of table X in a relational database for data insertion and task B to insert data into that table.
    • If the DAG was updated to change the table from X to Y in the middle of a run, task A (from the old version) might pass the table name X while the updated task B inserted data intended for table Y into table X.

DAG bundles and DAG versioning were introduced in Airflow 3 to address these issues.

DAG Versioning vs DAG Bundles

Airflow 3 introduces two new concepts:

  • DAG versioning: Airflow now keeps track of changes to your DAGs. This is automatic and happens no matter which DAG bundle is used.
    • A new DAG version is created every time a DAG run is created for a DAG that has undergone a structural change since the last run. A structural change is any change that affects serdag, this includes changes to DAG or task parameters, task dependencies, task IDs or adding or removing tasks.
    • Each DAG run is associated with a DAG version that is visible in the Airflow UI.
    • Whenever a new DAG run is initiated, the scheduler uses the latest version of the DAG to create a run.
  • DAG bundle: A collection of files containing DAG code and supporting files. DAG bundles are named after the backend they use to store the DAG code. For example, the LocalDagBundle uses the local file system to store DAG code, while the GitDagBundle uses a Git repository.
    • Some DAG bundles are versioned, such as the GitDagBundle. A version of a DAG bundle is created by versioning the underlying backend. For example, a new version of the GitDagBundle is created by every Git commit, whether or not any DAGs change.
    • The default LocalDagBundle is not versioned.

DAG versioning is automatic in Airflow 3 and does not require any setup. Using a DAG bundle other than LocalDagBundle requires changes to your Airflow configuration.

DAG versioning

You can view DAG versions in several places in the Airflow UI. In the Options menu of the DAG graph, you can select which version of the DAG graph you want to display. The DAG details page also shows the latest available version of the DAG, which is used to create new DAG runs.

DAG versioning in the Airflow UI graph.

The DAG grid now retains the history for all tasks, even if they were removed in the latest version of the DAG. You can also select which version of the DAG code you want to display in the code tab.

DAG versioning in the Airflow UI grid and code tab.

DAG bundles

DAG bundles contain DAG code and supporting files. There are versioned and unversioned DAG bundles; the default DAG bundle (LocalDagBundle) is not versioned, while the GitDagBundle is versioned. Support for other DAG bundle backends is planned for future releases.

Versioned and unversioned DAG bundles behave differently in the following situations:

  • Clearing and rerunning a previous DAG run:
    • Unversioned DAG bundle: Airflow uses the current DAG code, i.e., the latest version of the DAG.
    • Versioned DAG bundle: The scheduler uses the DAG version that existed at the time of the DAG run to determine which task instances to create. The workers use the code contained in the DAG bundle version that existed at the time of the original DAG run to execute their tasks.
  • Rerunning individual tasks of a previous DAG run:
    • Unversioned DAG bundle: Airflow uses the latest version of the DAG for tasks that are rerun.
    • Versioned DAG bundle: Airflow uses the code of the task contained in the DAG bundle version at the time of the original DAG run.
  • Changing code while a DAG is running:
    • Unversioned DAG bundle: The DAG always uses the current DAG code at the time it starts a task, as in Airflow 2.
    • Versioned DAG bundle: The DAG run finishes using the bundle version it started with.
  • Making code changes:
    • Unversioned DAG bundle: Every structural change to the DAG creates a new DAG version.
    • Versioned DAG bundle: Every committed or saved structural change to a DAG creates a new version of that DAG. This means with every new bundle version, all DAGs that have had structural changes will also have a new DAG version.
note

For DAGs running on Astro, Astronomer’s managed Airflow service, a specialized versioned DAG bundle is configured automatically, without any need for additional setup. See the Astronomer documentation for more information.

See the Airflow documentation on DAG bundles for more information, including how to create a custom DAG bundle.

Set up a GitDagBundle

To directly fetch your DAG code from a GitHub repository, you can use the GitDagBundle. This bundle is versioned. To configure a GitDagBundle for an Astro CLI project, follow these steps:

  1. Push your DAG code to a GitHub repository.

  2. Install the git package in your Astro project by adding it to your packages.txt file.

  3. Install the Airflow Git provider by adding the following to your requirements.txt file. Replace <version> with the latest version of the provider package.

    apache-airflow-providers-git==<version>
  4. Define a Git connection using an environment variable in your .env file. Replace <account> and <repo> with the name of your GitHub account and repository, respectively. Replace github_pat_<your-token> with your GitHub personal access token. Note that the token needs to have read and write access to the code of the repository.

    AIRFLOW_CONN_MY_GIT_CONN='{
    "conn_type": "git",
    "host": "https://github.com/<account>/<repo>.git",
    "password": "github_pat_<your-token>"
    }'
  5. Change the [dag_processor].dag_bundle_config_list configuration to use a GitDagBundle by setting the associated environment variable in your .env file. Replace your-bundle-name with the name your want to give to your DAG bundle. The subdir should point to the directory in your GitHub repository where your DAG code is stored. The tracking_ref should point to the branch you want to use.

    AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST='[
    {
    "name": "your-bundle-name",
    "classpath": "airflow.providers.git.bundles.git.GitDagBundle",
    "kwargs": {
    "git_conn_id": "my_git_conn",
    "subdir": "dags",
    "tracking_ref": "main"
    }
    }
    ]'
  6. Restart your project using astro dev restart to apply the changes.

Programmatic DAGs and DAG bundles

If you are creating your DAGs programmatically, i.e. you are using Python code to generate your DAG code and want to use a versioned DAG bundle, you need to ensure that there are no DAG structure changes without a DAG bundle change.

The reason is, that when clearing a DAG run, the scheduler uses the DAG bundle version that existed at the time of the DAG run to determine which task instances to create. The workers use the code contained in the DAG bundle version that existed at the time of the original DAG run to execute their tasks. In the rare case where programmatic DAG creation leads to a DAG structure, and therefore DAG version change without a DAG bundle change, the scheduler and workers will use different DAG versions to create and execute the tasks. This can lead to unexpected behavior.

An example for programmatic DAG creation that is safe to use with a versioned DAG bundle is usage of the dag-factory or to create tasks in a loop that only changes when the code changes:


# this list only changes when the code changes
my_tables = ["TABLE_A", "TABLE_B", "TABLE_C"]

for my_table in my_tables:
@task(
task_id=f"modify_{i}",
)
def modify_table(my_table):
# do something with the table
pass

modify_table(my_table=my_table)

If you are using top-level code that connects to an external system (a practice that we caution against, see Avoid top-level code in your DAG file), you might have a change in DAG structure without a change in the DAG bundle. An example would be if the list my_tables from the above example is created by querying a database.

Was this page helpful?