Datasets and data-aware scheduling in Airflow

With Datasets, DAGs that access the same data can have explicit, visible relationships, and DAGs can be scheduled based on updates to these datasets. This feature helps make Airflow data-aware and expands Airflow scheduling capabilities beyond time-based methods such as cron.

Datasets can help resolve common issues. For example, consider a data engineering team with a DAG that creates a dataset and a machine learning team with a DAG that trains a model on the dataset. Using datasets, the machine learning team’s DAG runs only when the data engineering team’s DAG has produced an update to the dataset.

In this guide, you’ll learn about datasets in Airflow and how to use them to implement triggering of DAGs based on dataset updates.

Datasets are a separate feature from object storage, which allows you to interact with files in cloud and local object storage systems. To learn more about using Airflow to interact with files, see Use Airflow object storage to interact with cloud storage in an ML pipeline.

Other ways to learn

There are multiple resources for learning about this topic. See also:

Astronomer Academy: Airflow: Datasets module.
Webinar: Datasets and data-aware scheduling in Airflow.
Use case: Orchestrate machine learning pipelines with Airflow datasets.

Assumed knowledge

To get the most out of this guide, you should have an existing knowledge of:

Airflow scheduling concepts. See Schedule DAGs in Airflow.

Why use Airflow datasets?

Datasets allow you to define explicit dependencies between DAGs and updates to your data. This helps you to:

Standardize communication between teams. Datasets can function like an API to communicate when data in a specific location has been updated and is ready for use.
Reduce the amount of code necessary to implement cross-DAG dependencies. Even if your DAGs don’t depend on data updates, you can create a dependency that triggers a DAG after a task in another DAG updates a dataset.
Get better visibility into how your DAGs are connected and how they depend on data. The Datasets tab in the Airflow UI shows a graph of all dependencies between DAGs and datasets in your Airflow environment.
Reduce costs, because datasets don’t use a worker slot in contrast to sensors or other implementations of cross-DAG dependencies.
Create cross-deployment dependencies using the Airflow REST API. Astro customers can use the Cross-deployment dependencies best practices documentation for guidance.
(Airflow 2.9+) Create complex data-driven schedules using Conditional Dataset Scheduling and Combined Dataset and Time-based Scheduling.

Dataset concepts

You can define datasets in your DAG code and use them to create cross-DAG or even cross-Deployment dependencies. This section covers definitions for dataset terminology, as well as general information on how to use them.

Dataset terminology

You can define datasets in your DAG code and use them to create cross-DAG dependencies. Airflow uses the following terms related to the datasets feature:

Dataset: an object that is defined by a unique URI. Airflow parses the URI for validity and there are some constraints on how you can define it. If you want to avoid validity parsing, prefix your dataset name with x- for Airflow to treat it as a string. See What is a valid URI? for detailed information.
Dataset event: an event that is attached to a dataset and created whenever a producer task updates that particular dataset. A dataset event is defined by being attached to a specific dataset plus the timestamp of when a producer task updated the dataset. Optionally, a dataset event can contain an extra dictionary with additional information about the dataset or dataset event.
Dataset schedule: the schedule of a DAG that is triggered as soon as dataset events for one or more datasets are created. All datasets a DAG is scheduled on are shown in the DAG graph in the Airflow UI, as well as reflected in the dependency graph of the Datasets tab.
Producer task: a task that produces updates to one or more datasets provided to its outlets parameter, creating dataset events when it completes successfully.
Dataset expression: (Airflow 2.9+) a logical expression using AND (&) and OR (|) operators to define the schedule of a DAG scheduled on updates to several datasets.
Queued dataset event: It is common to have DAGs scheduled to run as soon as a set of datasets have received at least one update each. While there are still dataset events missing to trigger the DAG, all dataset events for other datasets the DAG is scheduled on are queued dataset events. A queued dataset event is defined by its dataset, timestamp and the DAG it is queuing for. One dataset event can create a queued dataset event for several DAGs. As of Airflow 2.9, you can access queued Dataset events for a specific DAG or a specific dataset programmatically, using the Airflow REST API.
DatasetAlias (Airflow 2.10+): an object that can be associated to one or more datasets and used to create schedules based on datasets created at runtime, see Use dataset aliases. A dataset alias is defined by a unique name.
Metadata (Airflow 2.10+): a class to attach extra information to a dataset from within the producer task. This functionality can be used to pass dataset-related metadata between tasks, see Attaching information to a dataset event.

Two parameters relating to Airflow datasets exist in all Airflow operators and decorators:

Outlets: a task parameter that contains the list of datasets a specific task produces updates to, as soon as it completes successfully. All outlets of a task are shown in the DAG graph in the Airflow UI, as well as reflected in the dependency graph of the Datasets tab as soon as the DAG code is parsed, that is, independently of whether or not any dataset events have occurred. Note that Airflow is not yet aware of the underlying data. It is up to you to determine which tasks should be considered producer tasks for a dataset. As long as a task has an outlet dataset, Airflow considers it a producer task even if that task doesn’t operate on the referenced dataset.
Inlets: a task parameter that contains the list of datasets a specific task has access to, typically to access extra information from related dataset events. Defining inlets for a task does not affect the schedule of the DAG containing the task and the relationship is not reflected in the Airflow UI.

To summarize, tasks produce updates to datasets given to their outlets parameter, and this action creates dataset events. DAGs can be scheduled based on dataset events created for one or more datasets, and tasks can be given access to all events attached to a dataset by defining the dataset as one of their inlets. A dataset is defined as an object in the Airflow metadata database as soon as it is referenced in either, the outlets parameter of a task or the schedule of a DAG.

Using datasets

When you work with datasets, keep the following considerations in mind:

Datasets events are only registered by DAGs or listeners in the same Airflow environment. If you want to create cross-Deployment dependencies with Datasets you will need to use the Airflow REST API to create a dataset event in the Airflow environment where your downstream DAG is located. See the Cross-deployment dependencies for an example implementation on Astro.
Airflow monitors datasets only within the context of DAGs and tasks. It doesn’t monitor updates to datasets that occur outside of Airflow. That is, Airflow won’t notice if you manually add a file to an S3 bucket referenced by a dataset. To create Airflow dependencies based on outside events, use Airflow sensors.
The Datasets tab in the Airflow UI provides an overview over recent dataset events, existing datasets as well as a graph showing all dependencies between DAGs containing producing tasks, datasets and consuming DAGs. See Datasets tab for more information.

Listening for dataset changes

As of Airflow 2.8, you can use listeners to enable Airflow to run any code when certain dataset events occur anywhere in your Airflow instance. There are two listener hooks for the following events:

on_dataset_created
on_dataset_changed

For examples, refer to our Create Airflow listeners tutorial. Dataset Events listeners are an experimental feature.

Dataset definition

A dataset is defined as an object in the Airflow metadata database as soon as it is referenced in either the outlets parameter of a task or the schedule of a DAG. Airflow 2.10 added the ability to create dataset aliases, see Use Dataset Aliases.

Basic Dataset definition

The simplest dataset schedule is one DAG scheduled based on updates to one dataset which is produced to by one task. In this example we define that the my_producer_task task in the my_producer_dag DAG produces updates to the s3://my-bucket/my-key/ dataset, creating attached dataset events, and schedule the my_consumer_dag DAG to run once for every dataset event created.

First, provide the dataset to the outlets parameter of the producer task.

1 from airflow.decorators import dag, task
2 from airflow.datasets import Dataset
3 
4 @dag(
5     start_date=None,
6     schedule=None,
7     catchup=False,
8 )
9 def my_producer_dag():
10 
11     @task(outlets=[Dataset("s3://my-bucket/my-key/")])
12     def my_producer_task():
13         pass
14 
15     my_producer_task()
16 
17 my_producer_dag()

Traditional

1 from airflow.models.dag import DAG
2 from airflow.datasets import Dataset
3 from airflow.operators.python import PythonOperator
4 
5 with DAG(
6     dag_id="my_producer_dag",
7     start_date=None,
8     schedule=None,
9     catchup=False,
10 ):
11 
12     def my_function():
13         pass
14 
15     my_task = PythonOperator(
16         task_id="my_producer_task",
17         python_callable=my_function,
18         outlets=[Dataset("s3://my-bucket/my-key/")]
19     )

You can see the relationship between the DAG containing the producing task (my_producer_dag) and the dataset in the Dependency Graph located in the Datasets tab of the Airflow UI. Note that this screenshot is using Airflow 2.10 and the UI might look different in previous versions.

Screenshot of the Dependency Graph of the Datasets tab showing my_producer_dag connected to the s3://my-bucket/my-key/ dataset.

In Airflow 2.9+ the graph view of the my_producer_dag shows the dataset as well.

Screenshot of a DAG Graph showing my_producer_task connected to the s3://my-bucket/my-key/ dataset.

Next, schedule the my_consumer_dag to run as soon as a new dataset event is produced to the s3://my-bucket/my-key/ dataset.

1 from airflow.decorators import dag
2 from airflow.datasets import Dataset
3 from airflow.operators.empty import EmptyOperator
4 from pendulum import datetime
5 
6 @dag(
7     start_date=datetime(2024, 8, 1),
8     schedule=[Dataset("s3://my-bucket/my-key/")],
9     catchup=False,
10 )
11 def my_consumer_dag():
12 
13     EmptyOperator(task_id="empty_task")
14 
15 my_consumer_dag()

Traditional

1 from airflow.models.dag import DAG
2 from airflow.datasets import Dataset
3 from airflow.operators.empty import EmptyOperator
4 from pendulum import datetime 
5 
6 with DAG(
7     dag_id="my_consumer_dag",
8     start_date=datetime(2024, 8, 1),
9     schedule=[Dataset("s3://my-bucket/my-key/")],
10     catchup=False,
11 ):
12 
13     EmptyOperator(task_id="empty_task")

You can see the relationship between the DAG containing the producing task (my_producer_dag), the consuming DAG my_consumer_dag and the dataset in the Dependency Graph located in the Datasets tab of the Airflow UI. Note that this screenshot is using Airflow 2.10 and the UI might look different in previous versions.

Screenshot of the Dependency Graph of the Datasets tab showing my_producer_dag connected to the s3://my-bucket/my-key/ dataset which is connected to my_consumer_dag

In Airflow 2.9+ the graph view of the my_consumer_dag shows the dataset as well.

Screenshot of a DAG Graph showing my_producer_task connected to the s3://my-bucket/my-key/ dataset.

After unpausing the my_consumer_dag, every successful completion of the my_producer_task task triggers a run of the my_consumer_dag.

Screenshot DAGs page with one run each of the my_producer_dag and my_consumer_dag as well as the dataset schedule displayed

In Airflow 2.10+ the producing task will list the Dataset Events it caused in its details page, including links to the Triggered Dag Runs.

Screenshot of the Details tab of the my_producer_task showing one Dataset event of the s3://my-bucket/my-key/ with one Triggered Dag Run

The triggered DAG run of the my_consumer_dag also lists the dataset event, including a link to the source dag from within which the dataset event was created.

Screenshot of the Details tab of the DAG run of the my_consumer_dag showing one Dataset event of the s3://my-bucket/my-key/

Use dataset aliases

In Airflow 2.10+ you have the option to create dataset aliases to schedule DAGs based on datasets with URIs generated at runtime. A dataset alias is defined by a unique name string and can be used in place of a regular dataset in outlets and schedules. Any number of dataset events updating different datasets can be attached to a dataset alias.

There are two ways to add a dataset event to a dataset alias:

Using the Metadata class.
Using outlet_events pulled from the Airflow context.

See the code below for examples, note how the URI of the dataset is determined at runtime inside the producing task.

Metadata

1 # from airflow.decorators import task
2 # from airflow.datasets import Dataset, DatasetAlias
3 # from airflow.datasets.metadata import Metadata
4 
5 my_alias_name = "my_alias"
6 
7 @task(outlets=[DatasetAlias(my_alias_name)])
8 def attach_event_to_alias_metadata():
9     bucket_name = "my-bucket"  # determined at runtime, for example based on upstream input
10     yield Metadata(
11         Dataset(f"s3://{bucket_name}/my-task"),
12         extra={"k": "v"},  # extra has to be provided, can be {}
13         alias=my_alias_name,
14     )
15 
16 attach_event_to_alias_metadata()

Outlet events

1 # from airflow.decorators import task
2 # from airflow.datasets import Dataset, DatasetAlias
3 # from airflow.datasets.metadata import Metadata
4 
5 my_alias_name = "my_alias"
6 
7 @task(outlets=[DatasetAlias(my_alias_name)])
8 def attach_event_to_alias_context(**context):
9     bucket_name = "my-other-bucket"   # determined at runtime, for example based on upstream input
10     outlet_events = context["outlet_events"]
11     outlet_events[my_alias_name].add(
12         Dataset(f"s3://{bucket_name}/my-task"), extra={"k": "v"}
13     )  # extra is optional
14 
15 attach_event_to_alias_context()

In the consuming DAG you can use a dataset alias in place of a regular dataset.

1 from airflow.decorators import dag
2 from airflow.operators.empty import EmptyOperator
3 from airflow.datasets import Dataset
4 from pendulum import datetime
5 
6 my_alias_name = "my_alias"
7 
8 @dag(
9     start_date=datetime(2024, 8, 1),
10     schedule=[DatasetAlias(my_alias_name)],
11     catchup=False,
12 )
13 def my_consumer_dag():
14 
15     EmptyOperator(task_id="empty_task")
16 
17 my_consumer_dag()

Since the dataset event is generated at runtime with a dynamic URI, Airflow doesn’t know in advance which dataset will trigger the run of the my_consumer_dag. The Airflow UI displays Unresolved DatasetAlias as the DAG schedule for DAGs that are only scheduled on aliases that have never had a dataset event attached to them.

Screenshot the DAGs view showing an Unresolved DatasetAlias schedule on my_consumer_dag.

Once the my_producer_dag containing the attach_event_to_alias_metadata task completes successfully, reparsing of all DAGs scheduled on the dataset alias my_alias is automatically triggered. This reparsing step attaches the s3://my-bucket/my-task dataset to the my_alias dataset alias and the schedule resolves, triggering one run of the my_consumer_dag.

Screenshot the DAGs view showing an the resolved dataset schedule and one successful run each for the my_producer_dag and my_consumer_dag.

Any further dataset event for the s3://my-bucket/my-task dataset will now trigger the my_consumer_dag. If you attach dataset events for several datasets to the same dataset alias, a DAG scheduled on that dataset alias will run as soon as any of the datasets that were ever attached to the dataset alias receive an update.

See Dynamic data events emitting and dataset creation through DatasetAlias for more information and examples of using dataset aliases.

To use Dataset Aliases with traditional operators, you need to attach the dataset event to the alias inside the operator logic. If you are using operators besides the PythonOperator, you can either do so in a custom operator’s .execute method or by passing a post_execute callable to existing operators (experimental). Use outlet_events when attaching dataset events to aliases in traditional or custom operators. Note that for deferrable operators, attaching a dataset event to an alias is only supported in the execute_complete or post_execute method.

1     def _attach_event_to_alias(context, result):  # result = the return value of the execute method
2         # use any logic to determine the URI
3         uri = "s3://my-bucket/my_file.txt"
4         context["outlet_events"][my_alias_name].add(Dataset(uri))  
5 
6     BashOperator(
7         task_id="t2",
8         bash_command="echo hi",
9         outlets=[DatasetAlias(my_alias_name)],
10         post_execute=_attach_event_to_alias,  # using the post_execute parameter is experimental
11     )

Click to view an example of a custom operator attaching a dataset event to a dataset alias.

1 """
2 ### Dataset Alias in a custom operator
3 """
4 
5 from airflow.decorators import dag
6 from airflow.datasets import Dataset, DatasetAlias
7 from pendulum import datetime
8 import logging
9 
10 t_log = logging.getLogger("airflow.task")
11 
12 my_alias_name = "my-alias"
13 
14 # import the operator to inherit from
15 from airflow.models.baseoperator import BaseOperator
16 
17 
18 # custom operator producing to a dataset alias
19 class MyOperator(BaseOperator):
20     """
21     Simple example operator that attaches a dataset event to a dataset alias.
22     :param my_bucket_name: (str) The name of the bucket to use in the dataset URI.
23     """
24 
25     # define the .__init__() method that runs when the DAG is parsed
26     def __init__(self, my_bucket_name, my_alias_name, *args, **kwargs):
27         # initialize the parent operator
28         super().__init__(*args, **kwargs)
29         # assign class variables
30         self.my_bucket_name = my_bucket_name
31         self.my_alias_name = my_alias_name
32 
33     def execute(self, context):
34 
35         # add your custom operator logic here
36 
37         # use any logic to derive the dataset URI
38         my_uri = f"s3://{self.my_bucket_name}/my_file.txt"
39         context["outlet_events"][self.my_alias_name].add(Dataset(my_uri))
40 
41         return "hi :)"
42 
43     # define the .post_execute() method that runs after the execute method (optional)
44     # result is the return value of the execute method
45     def post_execute(self, context, result=None):
46         # write to Airflow task logs
47         self.log.info("Post-execution step")
48 
49         # It is also possible to add events to the alias in the post_execute method
50 
51 
52 @dag(
53     start_date=datetime(2024, 8, 1),
54     schedule=None,
55     catchup=False,
56     doc_md=__doc__,
57 )
58 def dataset_alias_custom_operator():
59 
60     MyOperator(
61         task_id="t1",
62         my_bucket_name="my-bucket",
63         my_alias_name=my_alias_name,
64         outlets=[DatasetAlias(my_alias_name)],
65     )
66 
67 
68 dataset_alias_custom_operator()

Updating a dataset

As of Airflow 2.9+ there are three ways to update a dataset:

A task with an outlet parameter that references the dataset completes successfully.
A POST request to the datasets endpoint of the Airflow REST API.
A manual update in the Airflow UI.

Attaching information to a dataset event

When updating a dataset in the Airflow UI or making a POST request to the Airflow REST API, you can attach extra information to the dataset event by providing an extra JSON payload. Airflow 2.10 added the possibility to add extra information from within the producing task using either the Metadata class or accessing outlet_events from the Airflow context. You can attach any information to the extra that was computed within the task, for example information about the dataset you are working with.

To use the Metadata class to attach information to a dataset, follow the example in the code snippet below. Make sure that the dataset used in the metadata class is also defined as an outlet in the producer task.

1 # from airflow.decorators import task
2 # from airflow.datasets import Dataset
3 # from airflow.datasets.metadata import Metadata
4 
5 my_dataset_1 = Dataset("x-dataset1")
6 
7 @task(outlets=[my_dataset_1])
8 def attach_extra_using_metadata():
9     num = 23
10     yield Metadata(my_dataset_1, {"myNum": num})
11 
12     return "hello :)"
13 
14 attach_extra_using_metadata()

Traditional

1 # from airflow.operators.python import PythonOperator
2 # from airflow.datasets import Dataset
3 # from airflow.datasets.metadata import Metadata
4 
5 my_dataset_1 = Dataset("x-dataset1")
6 
7 def attach_extra_using_metadata_func():
8     num = 23
9     yield Metadata(my_dataset_1, {"myNum": num})
10 
11     return "hello :)"
12 
13 attach_extra_using_metadata = PythonOperator(
14     task_id="attach_extra_using_metadata",
15     python_callable=my_function,
16     outlets=[my_dataset_1]
17 )

You can also access the outlet_events from the Airflow context directly to add an extra dictionary to a dataset event.

1 # from airflow.decorators import task
2 # from airflow.datasets import Dataset
3 # from airflow.datasets.metadata import Metadata
4 
5 my_dataset_2 = Dataset("x-dataset2")
6 
7 @task(outlets=[my_dataset_2])
8 def use_outlet_events(**context):
9     num = 19
10     context["outlet_events"][my_dataset_2].extra = {"my_num": num}
11 
12     return "hello :)"
13 
14 use_outlet_events()

Traditional

1 # from airflow.operators.python import PythonOperator
2 # from airflow.datasets import Dataset
3 # from airflow.datasets.metadata import Metadata
4 
5 my_dataset_2 = Dataset("x-dataset2")
6 
7 def attach_extra_using_metadata_func():
8     num = 19
9     context["outlet_events"][my_dataset_2].extra = {"my_num": num}
10 
11     return "hello :)"
12 
13 attach_extra_using_metadata = PythonOperator(
14     task_id="attach_extra_using_metadata",
15     python_callable=my_function,
16     outlets=[my_dataset_2]
17 )

Dataset extras can be viewed in the Airflow UI in the Dataset Events list on the producing task, consuming DAG run, as well as in the Datasets tab.

Screenshot of the Dataset Events list under the Datasets tab in the Airflow UI showing two datasets with one extra each

Retrieving dataset information in a downstream task

Extras can be programmatically retrieved from within Airflow tasks. Any Airflow task instance in a DAG run has access to the list of datasets that were involved in triggering that specific DAG run (triggering_dataset_events). Additionally, you can give any Airflow task access to all dataset events of a specific dataset by providing the dataset to the task’s inlets parameter. Defining inlets does not affect the schedule of the DAG.

To access the all dataset events that were involved in triggering a DAG run within a TaskFlow API task, simply pull it from the Airflow context. In a traditional operator, you can use Jinja templating in any templateable field of the operator to pull information from the Airflow context.

1 # from airflow.decorators import task
2 
3 @task
4 def get_extra_triggering_run(**context):
5     # all events that triggered this specific DAG run
6     triggering_dataset_events = context["triggering_dataset_events"]
7     # the loop below wont run if the DAG is manually triggered
8     for dataset, dataset_list in triggering_dataset_events.items():
9         print(dataset, dataset_list)
10         print(dataset_list[0].extra)
11         # you can also fetch the run_id and other information about the upstream DAGs, 
12         # note that this will error if the Dataset was updated via the API! 
13         print(dataset_list[0].source_dag_run.run_id)

Traditional

1 # from airflow.operators.bash import BashOperator
2 
3 get_extra_triggering_run_bash = BashOperator(
4     task_id="get_extra_triggering_run_bash",
5     # This statement errors when there are no triggering events, for example in a manual run!
6     bash_command="echo {{ (triggering_dataset_events.values() | first | first).extra['myNum'] }} ", 
7 
8     # The below version returns an empty string if there are no triggering dataset events or the extra is not present
9     # bash_command="echo {{ (triggering_dataset_events.values() | default([]) | first | default({}) | first | default({})).extra.get('myNum', '') if (triggering_dataset_events.values() | default([]) | first | default({}) | first | default({})).extra is defined else '' }}"
10 )

If you want to access dataset extras independently from which dataset events triggered a DAG run, you have the option to directly provide a dataset to a task as an inlet. In a TaskFlow API task you can fetch the inlet_events from the Airflow context, in a traditional operator you can use Jinja templating to access them.

1 # from airflow.decorators import task
2 # from airflow.datasets import Dataset
3 
4 my_dataset_2 = Dataset("x-dataset2")
5 
6 # note that my_dataset_2 does not need to be part of the DAGs schedule
7 # you can provide as many inlets as you wish
8 @task(inlets=[my_dataset_2])
9 def get_extra_inlet(**context):
10     # inlet_events are listed earliest to latest by timestamp
11     dataset_events = context["inlet_events"][my_dataset_2]
12     # protect against the dataset not existing
13     if len(dataset_events) == 0:
14         print(f"No dataset_events for {my_dataset_2.uri}")
15     else:
16         # accessing the latest dataset event for this dataset
17         # if the extra does not exist, return None
18         my_num = dataset_events[-1].extra.get("myNum", None)
19         print(my_num)
20 
21 get_extra_inlet()

Traditional

1 # from airflow.operators.bash import BashOperator
2 # from airflow.datasets import Dataset
3 
4 my_dataset_2 = Dataset("x-dataset2")
5 
6 get_extra_inlet_bash = BashOperator(
7     task_id="get_extra_inlet_bash",
8     inlets=[Dataset("x-dataset2")],
9     # This statement will error if the x-dataset2 dataset has no previous dataset events
10     bash_command="echo {{ inlet_events['x-dataset2'][-1].extra['myNum'] }} ", 
11 
12     # The below version returns an empty string if there are no triggering dataset events or the extra is not present, it errors when the dataset does not exist at all.
13     # bash_command="echo {{ (inlet_events['x-dataset2'] | default([]) | last | default({})).extra.get('myNum', '') if (inlet_events['x-dataset2'] | default([]) | last | default({})).extra is defined else '' }}",
14 )

Note that you can programmatically retrieve information from dataset aliases as well, see Fetching information from previously emitted dataset events through resolved dataset aliases for more information.

Dataset schedules

Any number of datasets can be provided to the schedule parameter. There are 3 types of dataset schedules:

schedule=[Dataset("a"), Dataset("b")]: Providing one or more Datasets as a list. The DAG is scheduled to run after all Datasets in the list have received at least one update.
schedule=(Dataset("a") | Dataset("b")): (Airflow 2.9+) Using AND (&) and OR (|) operators to create a conditional dataset expression. Note that dataset expressions are enclosed in smooth brackets ().
DatasetOrTimeSchedule: (Airflow 2.9+) Combining time based scheduling with dataset expressions, see combined dataset and time-based scheduling.

When scheduling DAGs based on datasets, keep the following in mind:

Consumer DAGs that are scheduled on a dataset are triggered every time a task that updates that dataset completes successfully. For example, if task1 and task2 both produce dataset_a, a consumer DAG of dataset_a runs twice - first when task1 completes, and again when task2 completes.
Consumer DAGs scheduled on a dataset are triggered as soon as the first task with that dataset as an outlet finishes, even if there are downstream producer tasks that also operate on the dataset.
Consumer DAGs scheduled on multiple datasets run as soon as their expression is fulfilled by at least one dataset event per dataset in the expression. This means that it doesn’t matter to the consuming DAG whether a dataset received additional updates in the meantime, it consumes all queued events for one dataset as one input. See Multiple Datasets for more information.
As of Airflow 2.10 a consumer DAG that is paused will ignore all updates to datasets that occurred while it was paused. Meaning, it starts with a blank slate upon being unpaused. In previous Airflow versions, a consumer DAG scheduled on one dataset that had received an update while the DAG was paused would run immediately when being unpaused.
DAGs that are triggered by datasets don’t have the concept of a data interval. If you need information about the triggering event in your downstream DAG, you can use the parameter triggering_dataset_events from the context. This parameter provides a list of all the triggering dataset events with the parameters [timestamp, source_dag_id, source_task_id, source_run_id, source_map_index ]. See Retrieving dataset information in a downstream task for an example.

Conditional dataset scheduling

In Airflow 2.9 and later, you can use logical operators to combine any number of datasets provided to the schedule parameter. The logical operators supported are | for OR and & for AND.

For example, to schedule a DAG on an update to either dataset1, dataset2, dataset3, or dataset4, you can use the following syntax. Note that the full statement is wrapped in ().

1 from airflow.decorators import dag
2 from airflow.models.datasets import Dataset
3 from pendulum import datetime
4 
5 @dag(
6     start_date=datetime(2024, 3, 1),
7     schedule=(
8         Dataset("dataset1")
9         | Dataset("dataset2")
10         | Dataset("dataset3")
11         | Dataset("dataset4")
12     ),  # Use () instead of [] to be able to use conditional dataset scheduling!
13     catchup=False,
14 )
15 def downstream1_on_any():
16 
17     # your tasks here
18 
19 downstream1_on_any()

Traditional

1 from airflow.models import DAG
2 from airflow.models.datasets import Dataset
3 from pendulum import datetime
4 
5 with DAG(
6     dag_id="downstream1_on_any",
7     start_date=datetime(2024, 3, 1),
8     schedule=(
9         Dataset("dataset1")
10         | Dataset("dataset2")
11         | Dataset("dataset3")
12         | Dataset("dataset4")
13     ),  # Use () instead of [] to be able to use conditional dataset scheduling!
14     catchup=False,
15 ):
16 
17     # your tasks here

The downstream1_on_any DAG is triggered whenever any of the datasets dataset1, dataset2, dataset3, or dataset4 are updated. When clicking on x of 4 Datasets updated in the DAGs view, you can see the dataset expression that defines the schedule.

Screenshot of the Airflow UI with a pop up showing the dataset expression for the downstream1_on_any DAG listing the 4 datasets under "any"

You can also combine the logical operators to create more complex expressions. For example, to schedule a DAG on an update to either dataset1 or dataset2 and either dataset3 or dataset4, you can use the following syntax:

1 from airflow.decorators import dag
2 from airflow.models.datasets import Dataset
3 from pendulum import datetime
4 
5 @dag(
6     start_date=datetime(2024, 3, 1),
7     schedule=(
8         (Dataset("dataset1") | Dataset("dataset2"))
9         & (Dataset("dataset3") | Dataset("dataset4"))
10     ),  # Use () instead of [] to be able to use conditional dataset scheduling!
11     catchup=False
12 )
13 def downstream2_one_in_each_group():
14 
15     # your tasks here
16 
17 downstream2_one_in_each_group()

Traditional

1 from airflow.models import DAG
2 from airflow.models.datasets import Dataset
3 from pendulum import datetime
4 
5 with DAG(
6     dag_id="downstream2_one_in_each_group",
7     start_date=datetime(2024, 3, 1),
8     schedule=(
9         (Dataset("dataset1") | Dataset("dataset2"))
10         & (Dataset("dataset3") | Dataset("dataset4"))
11     ),  # Use () instead of [] to be able to use conditional dataset scheduling!
12     catchup=False,
13 ):
14 
15     # your tasks here

The dataset expression this schedule creates is:

{
  "all": [
    {
      "any": [
        "dataset1",
        "dataset2"
      ]
    },
    {
      "any": [
        "dataset3",
        "dataset4"
      ]
    }
  ]
}

Combined dataset and time-based scheduling

In Airflow 2.9 and later, you can combine dataset-based scheduling with time-based scheduling with the DatasetOrTimeSchedule timetable. A DAG scheduled with this timetable will run either when its timetable condition is met or when its dataset condition is met.

The following DAG runs on a time-based schedule defined by the 0 0 * * * cron expression, which is every day at midnight. The DAG also runs when either dataset3 or dataset4 is updated.

1 from airflow.decorators import dag, task
2 from airflow.datasets import Dataset
3 from pendulum import datetime
4 from airflow.timetables.datasets import DatasetOrTimeSchedule
5 from airflow.timetables.trigger import CronTriggerTimetable
6 
7 @dag(
8     start_date=datetime(2024, 3, 1),
9     schedule=DatasetOrTimeSchedule(
10         timetable=CronTriggerTimetable("0 0 * * *", timezone="UTC"),
11         datasets=(Dataset("dataset3") | Dataset("dataset4")),
12         # Use () instead of [] to be able to use conditional dataset scheduling!
13     ), 
14     catchup=False,
15 )
16 def toy_downstream3_dataset_and_time_schedule():
17 
18     # your tasks here
19 
20 toy_downstream3_dataset_and_time_schedule()

Traditional

1 from airflow.models import DAG
2 from airflow.datasets import Dataset
3 from pendulum import datetime
4 from airflow.timetables.datasets import DatasetOrTimeSchedule
5 from airflow.timetables.trigger import CronTriggerTimetable
6 
7 with DAG(
8     dag_id="toy_downstream3_dataset_and_time_schedule",
9     start_date=datetime(2024, 3, 1),
10     schedule=DatasetOrTimeSchedule(
11         timetable=CronTriggerTimetable("0 0 * * *", timezone="UTC"),
12         datasets=(Dataset("dataset3") | Dataset("dataset4")),
13         # Use () instead of [] to be able to use conditional dataset scheduling!
14     ), 
15     catchup=False,
16 ):
17     # your tasks here

Example implementation

In the following example, the write_instructions_to_file and write_info_to_file are both producer tasks because they have defined outlets.

1 from pendulum import datetime
2 from airflow.datasets import Dataset
3 from airflow.decorators import dag, task
4 
5 API = "https://www.thecocktaildb.com/api/json/v1/1/random.php"
6 INSTRUCTIONS = Dataset("file://localhost/airflow/include/cocktail_instructions.txt")
7 INFO = Dataset("file://localhost/airflow/include/cocktail_info.txt")
8 
9 
10 @dag(
11     start_date=datetime(2022, 10, 1),
12     schedule=None,
13     catchup=False,
14 )
15 def datasets_producer_dag():
16     @task
17     def get_cocktail(api):
18         import requests
19 
20         r = requests.get(api)
21         return r.json()
22 
23     @task(outlets=[INSTRUCTIONS])
24     def write_instructions_to_file(response):
25         cocktail_name = response["drinks"][0]["strDrink"]
26         cocktail_instructions = response["drinks"][0]["strInstructions"]
27         msg = f"See how to prepare {cocktail_name}: {cocktail_instructions}"
28 
29         f = open("include/cocktail_instructions.txt", "a")
30         f.write(msg)
31         f.close()
32 
33     @task(outlets=[INFO])
34     def write_info_to_file(response):
35         import time
36 
37         time.sleep(30)
38         cocktail_name = response["drinks"][0]["strDrink"]
39         cocktail_category = response["drinks"][0]["strCategory"]
40         alcohol = response["drinks"][0]["strAlcoholic"]
41         msg = f"{cocktail_name} is a(n) {alcohol} cocktail from category {cocktail_category}."
42         f = open("include/cocktail_info.txt", "a")
43         f.write(msg)
44         f.close()
45 
46     cocktail = get_cocktail(api=API)
47 
48     write_instructions_to_file(cocktail)
49     write_info_to_file(cocktail)
50 
51 
52 datasets_producer_dag()

Traditional

1 from pendulum import datetime
2 from airflow import DAG, Dataset
3 from airflow.decorators import task
4 from airflow.operators.python import PythonOperator
5 
6 API = "https://www.thecocktaildb.com/api/json/v1/1/random.php"
7 INSTRUCTIONS = Dataset("file://localhost/airflow/include/cocktail_instructions.txt")
8 INFO = Dataset("file://localhost/airflow/include/cocktail_info.txt")
9 
10 
11 def get_cocktail_func(api):
12     import requests
13 
14     r = requests.get(api)
15     return r.json()
16 
17 
18 def write_instructions_to_file_func(response):
19     cocktail_name = response["drinks"][0]["strDrink"]
20     cocktail_instructions = response["drinks"][0]["strInstructions"]
21     msg = f"See how to prepare {cocktail_name}: {cocktail_instructions}"
22 
23     f = open("include/cocktail_instructions.txt", "a")
24     f.write(msg)
25     f.close()
26 
27 
28 def write_info_to_file_func(response):
29     import time
30 
31     time.sleep(30)
32     cocktail_name = response["drinks"][0]["strDrink"]
33     cocktail_category = response["drinks"][0]["strCategory"]
34     alcohol = response["drinks"][0]["strAlcoholic"]
35     msg = (
36         f"{cocktail_name} is a(n) {alcohol} cocktail from category {cocktail_category}."
37     )
38     f = open("include/cocktail_info.txt", "a")
39     f.write(msg)
40     f.close()
41 
42 
43 with DAG(
44     dag_id="datasets_producer_dag",
45     start_date=datetime(2022, 10, 1),
46     schedule=None,
47     catchup=False,
48     render_template_as_native_obj=True,
49 ):
50     get_cocktail = PythonOperator(
51         task_id="get_cocktail",
52         python_callable=get_cocktail_func,
53         op_kwargs={"api": API},
54     )
55 
56     write_instructions_to_file = PythonOperator(
57         task_id="write_instructions_to_file",
58         python_callable=write_instructions_to_file_func,
59         op_kwargs={"response": "{{ ti.xcom_pull(task_ids='get_cocktail') }}"},
60         outlets=[INSTRUCTIONS],
61     )
62 
63     write_info_to_file = PythonOperator(
64         task_id="write_info_to_file",
65         python_callable=write_info_to_file_func,
66         op_kwargs={"response": "{{ ti.xcom_pull(task_ids='get_cocktail') }}"},
67         outlets=[INFO],
68     )
69 
70     get_cocktail >> write_instructions_to_file >> write_info_to_file

A consumer DAG runs whenever the dataset(s) it is scheduled on is updated by a producer task, rather than running on a time-based schedule. For example, if you have a DAG that should run when the INSTRUCTIONS and INFO datasets are updated, you define the DAG’s schedule using the names of those two datasets.

Any DAG that is scheduled with a dataset is considered a consumer DAG even if that DAG doesn’t actually access the referenced dataset. In other words, it’s up to you as the DAG author to correctly reference and use datasets.

1 from pendulum import datetime
2 from airflow.datasets import Dataset
3 from airflow.decorators import dag, task
4 
5 INSTRUCTIONS = Dataset("file://localhost/airflow/include/cocktail_instructions.txt")
6 INFO = Dataset("file://localhost/airflow/include/cocktail_info.txt")
7 
8 
9 @dag(
10     dag_id="datasets_consumer_dag",
11     start_date=datetime(2022, 10, 1),
12     schedule=[INSTRUCTIONS, INFO],  # Scheduled on both Datasets
13     catchup=False,
14 )
15 def datasets_consumer_dag():
16     @task
17     def read_about_cocktail():
18         cocktail = []
19         for filename in ("info", "instructions"):
20             with open(f"include/cocktail_{filename}.txt", "r") as f:
21                 contents = f.readlines()
22                 cocktail.append(contents)
23 
24         return [item for sublist in cocktail for item in sublist]
25 
26     read_about_cocktail()
27 
28 
29 datasets_consumer_dag()

Traditional

1 from pendulum import datetime
2 from airflow import DAG, Dataset
3 from airflow.operators.python import PythonOperator
4 
5 INSTRUCTIONS = Dataset("file://localhost/airflow/include/cocktail_instructions.txt")
6 INFO = Dataset("file://localhost/airflow/include/cocktail_info.txt")
7 
8 
9 def read_about_cocktail_func():
10     cocktail = []
11     for filename in ("info", "instructions"):
12         with open(f"include/cocktail_{filename}.txt", "r") as f:
13             contents = f.readlines()
14             cocktail.append(contents)
15 
16     return [item for sublist in cocktail for item in sublist]
17 
18 
19 with DAG(
20     dag_id="datasets_consumer_dag",
21     start_date=datetime(2022, 10, 1),
22     schedule=[INSTRUCTIONS, INFO],  # Scheduled on both Datasets
23     catchup=False,
24 ):
25     PythonOperator(
26         task_id="read_about_cocktail",
27         python_callable=read_about_cocktail_func,
28     )