Data-driven scheduling, a new feature in Airflow 2.4, equips Airflow with the built-in logic it needs to run and manage different kinds of data-driven dependencies, through a new object type called
Basically, a Dataset is any output from one or more upstream (“producer”) tasks that is used as a source of data for one or more downstream DAGs, called “consumers.” Airflow now “knows” to run consumer DAGs automatically once the Datasets they depend on have been updated, making it much easier to design for common types of data engineering patterns.
It’s not a stretch to say that data-driven scheduling is going to dramatically improve the lives of DAG authors. “The thing that’s going to make the biggest difference for people building data pipelines is that you now have this way of telling Airflow, ‘I am an upstream thing and I am done running’ — and at that point, everything downstream just starts running,” says Astronomer Senior Data Architect Chris Cardillo.
At Astronomer, we stand to benefit from data-driven scheduling as much as any user of Airflow. Astro, Astronomer’s fully managed, cloud-native Airflow service, relies on Airflow’s built-in scheduling and dependency management logic to power its orchestration capabilities. And Astronomer relies on Astro to orchestrate the data pipelines that power day-to-day business operations.
Cardillo and his team are already brainstorming how to implement data-driven scheduling for Astronomer’s data transformation layer. At the moment, he says, the ingestion-level DAGs that populate Astronomer’s data warehouse are still running on a daily, hourly, or even faster schedule; they don’t yet have explicitly defined upstream data dependencies of their own.
But Cardillo says the data team has already refactored its first Dataset DAG: a
model_cloud_costs DAG that Astronomer uses to create a table of unified billing data from AWS, Azure, and Google Cloud Platform (GCP), so that it can monitor its costs. The team was able to update this DAG without making any changes to its existing tasks: the only thing it changed was the schedule, which is now just a list of Datasets.
Next, the data team plans to refactor Astronomer’s “ecosystem” DAGs to take advantage of data-driven scheduling. These are the DAGs Astronomer uses to acquire, cleanse, transform, and model data that’s derived from PyPi, GitHub, DockerHub, and other sources.
Scheduling DAGs with Datasets
Currently, Astronomer’s downstream ecosystem DAGs consist of:
- A modeling DAG, which cleans and models ecosystem data.
- A metrics DAG, which computes different types of measurements over time for analysis.
Right now, Astronomer’s data team is looking at how to refactor the modeling and metric DAGs to function as consumer DAGs. To do this, Cardillo and his team need to make a few changes:
- Use Airflow’s new
Datasetclass to define and name each Dataset object in upstream DAGs (the ingestion DAGs that pull data from sources like PyPi downloads and GitHub).
- Explicitly reference each
Datasetobject in all producer tasks in upstream DAGs.
- Define each
Datasetobject in the consumer DAGs, and use that definition as the
In practice, this will enable the data team to design to a pattern in which, once Astronomer’s upstream ingestion (producer) tasks complete successfully, Airflow “knows” to trigger the downstream modeling DAG (“consumer” of this data) to run. Moreover, once the tasks in the consumer DAG complete successfully, Airflow will “know” to trigger DAGs even further downstream, like the metrics DAG, which only depends on the modeling DAG.
All of this will happen automatically, as part of a sequence in which producer tasks and consumer DAGs function much like event-driven links in an extended causal chain.
At the DAG level, data-driven scheduling allows the Astronomer team to remove the external task sensors that “listen” for changes in data sources. The sensors can be replaced with objects — i.e., “Datasets” — that can be observed and managed with the Airflow UI.
Taking advantage of this feature should be relatively painless. In upstream DAGs, Astronomer’s data team will use Airflow’s
Dataset class to define and name Dataset objects, using its
outlets parameter to associate them with specific producer tasks.
For example, say you have a pattern that contains a producer task, and that task’s job is to update a database. Once the producer task successfully finishes running, you want this event to automatically trigger a downstream consumer DAG. The data team can create a Dataset just for this:
mydbms_dataset = Dataset(‘mydbms’)]
The team can then associate that Dataset by providing it as a parameter to the producer task:
Datasets can be defined as needed for each individual use case. For example, you can now trigger a run by a downstream consumer DAG whenever an upstream producer task updates a table, or even a specific column, in a database. (To be clear, right now Airflow doesn’t “know” what a database is, or that databases have tables and columns. Airflow “knows” that an upstream event — e.g., a successful run by the producer task that updates a column in a database table — took place, and that, as a result, this triggers one or more consumer DAGs.)
The code for doing that would look like this:
In Astronomer’s downstream consumer DAGs, the team will need to define the same Dataset objects (using the
Dataset class), making sure to give them the same names it gave them in upstream producer tasks. In practice, most consumer DAGs will have multiple upstream Dataset dependencies, meaning that each will likely reference multiple
The final step is to
This tells the scheduler to run the consumer DAG whenever the upstream producer tasks that it depends on run and complete successfully. So, every time an upstream Dataset gets refreshed, corresponding consumer DAGs will run automatically. This gives Astronomer a much more reliable way of provisioning the data that feeds its reports, dashboards, and alerts, as well as its reverse ETL layer.
A drastically simplified DAG authoring experience
Astronomer’s data team leaders estimate that refactoring its modeling and metrics ecosystem pipelines as consumer DAGs will enable the company to eliminate a massive amount of complexity — which will radically simplify the jobs of DAG authors and the DevOps and Site Reliability Engineering (SRE) personnel who keep Astronomer’s data platform running. For example, DAG authors will “no longer have to write and maintain their own external task sensors, and can let Airflow manage all of that work for them,” Cardillo points out.
He notes that Airflow 2.4’s new data-driven scheduling feature also frees DAG authors from having to design elaborate logic to work around timing problems with external task sensors, tasks, and their dependencies. In particular, Cardillo says, because Airflow now “knows” not to start downstream DAGs until the upstream Datasets they depend on are available (or refreshed), the time dimension is much less of a concern in writing and maintaining DAGs.
In prior versions of Airflow, the time dimension was always front of mind for the user: When a DAG author wanted to use an external task sensor to automate one or more downstream tasks, they had to take into account the timings between those tasks and their upstream dependencies. If an ETL processing task took less than 50 minutes, on average, to complete, you’d set a timeout on your sensor for, say, 55-60 minutes. You did this to avoid squandering Airflow resources: having your worker slot continuously polling the upstream resource to ask, “Is the new data there yet?”
But what if your ETL processing took longer than usual? What if your downstream tasks didn’t run because the external task sensor that was supposed to trigger them timed out? You would need to create extra logic to accommodate this edge case — along with any others you discovered along the way.
“We’ll still have to use custom sensors in a few places,” Cardillo says, “but Datasets goes a long way toward making us think less about time in general. With Airflow becoming data-aware, we can remove a lot of the logic we’ve had to create to handle timing from our transformation layer.”
As the person who has to worry about the performance and reliability of the data team’s DAGs, and as a DAG author himself, Cardillo says he’s excited about data-dependent scheduling as a way to avoid having “to instrument or kludge dependencies between DAGs to deal with timing-related problems.”
Learn more about Datasets and data-driven scheduling in our latest guide.