Note: Astronomer highly recommends avoiding SubDAGs if the intended use of the SubDAG is to simply group tasks within a DAG's Graph View. Airflow 2.0 introduces Task Groups which is a UI grouping concept that satisfies this purpose without the performance and functional issues of SubDAGs. While the SubDagOperator will continue to be supported, Task Groups are intended to replace it long-term.

Most DAGs consist of patterns that often repeat themselves. ETL DAGs that are written to best practice usually all share the pattern of grabbing data from a source, loading it to an intermediary file store or staging table, and then pushing it into production data.

Depending on your set up, using a SubDagOperator could make your DAG cleaner.

Suppose the DAG looks like:

no_subdag

The pattern between extracting and loading the data is clear. The same workflow can be generated through SubDAGs:

subdag

Each of the SubDAGs can be zoomed in on:

zoom

The zoomed view reveals a granular view of the task:

tasks

SubDAGs should be generated through a "DAG factory" - an external file that returns DAG objects.

def load_subdag(parent_dag_name, child_dag_name, args):
    dag_subdag = DAG(
        dag_id='{0}.{1}'.format(parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval="@daily",
    )
    with dag_subdag:
        for i in range(5):
            t = DummyOperator(
                task_id='load_subdag_{0}'.format(i),
                default_args=args,
                dag=dag_subdag,
            )

    return dag_subdag

This object should then be called when instantiating the SubDagOperator:

load_tasks = SubDagOperator(
    task_id="load_tasks",
    subdag=load_subdag(
        parent_dag_name="example_subdag_operator",
        child_dag_name="load_tasks",
        args=default_args
    ),
    default_args=default_args,
    dag=dag,
)
  • The SubDAG should be named with a parent.child style or Airflow will throw an error.
  • The state of the SubDagOperator and the tasks themselves are independent - a SubDagOperator marked as success (or failed) will not affect the underlying tasks. This can be dangerous.
  • SubDAGs should be scheduled the same as their parent DAGs or unexpected behavior might occur.

Avoiding Deadlock

Greedy subdags

SubDAGs are not currently first-class citizens in Airflow. Although it is in the community's roadmap to fix this, many organizations using Airflow have outright banned them because of how they are executed.

Airflow 1.10 has changed the default SubDAG execution method to use the Sequential Executor to work around deadlocks caused by SubDAGs.

Slots on the worker pool

The SubDagOperator kicks off an entire DAG when it is put on a worker slot. Each task in the child DAG takes up a slot until the entire SubDAG has completed. The parent operator will take up a worker slot until each child task has completed. This could cause delays in other task processing

In mathematical terms, each SubDAG is behaving like a vertex (a single point in a graph) instead of a graph.

Depending on the scale and infrastructure, a specialized queue can be added just for SubDAGs (assuming a CeleryExecutor), but a cleaner workaround is to avoid SubDAGs entirely.

Never miss an update from us.

Do Airflow the easy way.

Run production-grade Airflow out-of-the-box with Astronomer.