Using SubDAGs in Airflow

Subdags

Aesthetically appealing, but fragile.

Most DAGs consist of patterns that often repeat themselves. ETL DAGs that are written to best practice usually all share the pattern of grabbing data from a source, loading it to an intermediary file store or staging table, and then pushing it into production data.

Depending on your set up, using a subdag operator could make your DAG cleaner.

Suppose the DAG looks like:

no_subdag

The pattern between extracting and loading the data is clear. The same workflow can be generated through subdags:

subdag

Each of the subdags can be zoomed in on:

zoom

The zoomed view reveals a granular view of the task:

tasks

Subdags should be generated through a "DAG factory" - an external file that returns dag objects.

def load_subdag(parent_dag_name, child_dag_name, args):
    dag_subdag = DAG(
        dag_id='{0}.{1}'.format(parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval="@daily",
    )
    with dag_subdag:
        for i in range(5):
            t = DummyOperator(
                task_id='load_subdag_{0}'.format(i),
                default_args=args,
                dag=dag_subdag,
            )

    return dag_subdag

This object should then be called when instanstiating the SubDagOperator:

load_tasks = SubDagOperator(
        task_id='load_tasks',
        subdag=extract_subdag('example_subdag_operator',
                           'load_tasks', default_args),
        default_args=default_args,
        dag=dag,
    )
  • The subdag should be named with a parent.child style or Airflow will throw an error.
  • The state of the SubDagOperator and the tasks themselves are independent - a SubDagOperator marked as success (or failed) will not affect the underlying tasks.This can be dangerous
  • SubDags should be scheduled the same as their parent DAGs or unexpected behavior might occur.

Avoiding Deadlock

Greedy subdags

SubDags are not currently first class citizens in Airflow. Although it is in the community's roadmap to fix this, many organizations using Airflow have outright banned them because of how they are executed.

Slots on the worker pool

The SubDagOperator kicks off an entire DAG when it is put on a worker slot. Each task in the child DAG takes up a slot until the entire SubDag has completed. The parent operator will take up a worker slot until each child task has completed. This could cause delays in other task processing

In mathematical terms, each SubDag is behaving like a vertex (a single point in a graph) instead of a graph.

Depending on the scale and infrastructure, a specialized queue can be added just for SubDags (assuming a CeleryExecutor), but a cleaner workaround is to avoid subdags entirely.

Astronomer highly recommends staying away from SubDags.


Subscribe to RSS
Ready to build your data workflows with Airflow?

Astronomer is the data engineering platform built by developers for developers. Send data anywhere with automated Apache Airflow workflows, built in minutes...