Integrate OpenLineage and Airflow

Info

This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.

Data lineage is the concept of tracking and visualizing data from its origin to wherever it flows and is consumed downstream. Lineage is growing in importance as companies must rely on increasingly complex data ecosystems to make business-critical decisions. Data lineage can help with everything from understanding your data sources, to troubleshooting job failures, to managing PII, to ensuring compliance with data regulations.

As a one-stop-shop orchestrator for an organization’s data pipelines, Apache Airflow is an ideal platform for integrating data lineage to understand the movement of and interactions within your data.

In this guide, you’ll learn about core data lineage concepts and understand how lineage works with Airflow.

Tip

Astro Observe offers robust support for extracting and visualizing data lineage. To learn more, see Astro Observe overview.

Other ways to learn

There are multiple resources for learning about this topic. See also:

Webinar: Introduction to Observability for Data Pipelines.

Webinar: How to build reliable data products with Astro.

Webinar: OpenLineage and Airflow: A Deeper Dive.

Guide: Leveraging data products for health and performance benefits.

Assumed knowledge

To get the most out of this guide, make sure you have an understanding of:

Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
Airflow operators. See Operators 101.

What is data lineage?

Data lineage represents the complex set of relationships that exist among datasets within an ecosystem. Typically, a lineage solution has three basic elements:

Lineage metadata, which describes your datasets (tables in Snowflake or S3 buckets, for example) and jobs (tasks in your DAG, for example).
A lineage backend, which stores and processes lineage metadata.
A lineage frontend, which allows you to view and interact with your lineage metadata – e.g., a visual graph of jobs, datasets and columns that shows how they are connected.

If you want to read more about the concept of data lineage and its value, see: What is data lineage and why does it matter?.

Visually, your data lineage graph might look like this lineage graph in Astro Observe:

Lineage Graph

If you are using data lineage, your solution most likely has a lineage backend that collects and stores lineage metadata and a frontend that visualizes the metadata. There are proprietary tools, including Astro Observe, that provide these services, and there are open-source options that can be integrated with Airflow: namely OpenLineage, a lineage specification and collection API for getting lineage from tools in your pipelines, and Marquez, a tool for storing and visualizing lineage that includes a metadata repository, query API, and graphical UI.

OpenLineage

OpenLineage is the open-source industry standard framework for data lineage. It standardizes the definition of data lineage, the metadata that makes up lineage metadata, and the approach to collecting lineage metadata from external systems. In other words, it defines a formalized specification for all the core concepts related to data lineage.

The purpose of an open standard for lineage is to create a more cohesive governance and monitoring experience across the industry and reduce duplicated work for stakeholders. It allows for a simpler, more consistent experience when integrating lineage from many different tools, similarly to how Airflow providers reduce the work of dag authoring by providing standardized modules for integrating Airflow with other tools.

Getting started on Astro

Astro Observe offers the easiest path to reliable lineage from Airflow using OpenLineage. You can use Astro Observe whether you use open-source Airflow, Astro, or another managed service for Airflow.

Core lineage concepts

The following terms are used frequently when discussing data lineage in general and OpenLineage in particular:

Integration: a means of gathering lineage metadata from a source system such as a scheduler or data platform. For example, the OpenLineage Airflow Provider allows lineage metadata to be collected from Airflow DAGs. Supported operators automatically gather lineage metadata from the source system every time a job runs, preparing and transmitting OpenLineage events to a lineage backend.
Extractor: an extractor is a module that gathers lineage metadata from a specific hook or operator. For example, in the OpenLineage Airflow Provider, extractors exist for the PythonOperator and BashOperator, meaning that if the provider is installed and running in your Airflow environment, lineage metadata will be generated automatically from these operators when your DAG runs. Also, the provider’s OperatorLineage extractor class enables custom extraction in many additional operators such as the SQLExecuteQueryOperator and AWS AthenaOperator. An extractor must exist for a specific operator to get lineage metadata from it.
Job: a process that consumes or produces datasets. Jobs can be viewed on your lineage graph. In the case of the provider, an OpenLineage job corresponds to a task in your DAG. Note that only tasks that come from operators with extractors will have input and output metadata; other tasks in your DAG will show up as orphans on the lineage graph. On Astro, jobs appear as nodes on your lineage graphs in the lineage UI.
Dataset: a representation of a set of data in your lineage metadata and graph. For example, it might correspond to a table in your database or a set of data on which you run a Great Expectations check. Typically, a dataset is registered as part of your lineage metadata when a job writing to the dataset is completed (e.g., data is inserted into a table).
Run: an instance of a job in which lineage metadata is generated. An OpenLineage run is generated with each DAG run, for example.
Facet: a piece of lineage metadata about a job, dataset, or run (e.g., you might hear “job facet” in reference to a piece of metadata attached to a job).

Why OpenLineage with Airflow?

Using OpenLineage with Airflow allows you to have more insight into the operation and structure of complex data ecosystems and supports better data governance. Airflow is a natural place to integrate data lineage because it is often used as a one-stop-shop orchestrator that touches data across many parts of an organization.

OpenLineage with Airflow provides the following capabilities:

Quickly find the root cause of task failures by identifying issues in upstream datasets (e.g., if an upstream job outside Airflow failed to populate a key dataset).
Easily see the blast radius of any job failures or changes to data by visualizing the relationship between jobs and datasets, including column-level lineage for some operators.
Identify where sensitive data is used in jobs across an organization.

These capabilities translate into real-world benefits by:

Making recovery from complex failures faster. The faster you can identify the problem and the blast radius, the easier it is to find a solution and prevent erroneous decisions based on bad data.
Making it easier for teams to work together across an organization. Visualizing the full scope of where an asset is used reduces “sleuthing” time.
Helping ensure compliance with data regulations by fully understanding where data is used in an organization.

Lineage on Astro

For Airflow users leveraging Astro Observe, data lineage is built-in. Viewing the graph in Astro Observe helps you troubleshoot issues with your data pipelines and understand the movement of data within a Data Product. For help getting started with Astro Observe, see: Astro Observe overview.

Lineage with open-source tools

If you want to build your own lineage stack using open-source tools, Astronomer recommends the official OpenLineage Airflow Provider for producing lineage. The provider supports many operators, and more are added regularly. You can find a list of currently supported operators in the Provider documentation. You won’t have to modify your DAGs to start emitting lineage information, but some basic configuration is necessary, including installing a package and setting up a transport. For more details about configuring the Provider, see: Using OpenLineage integration.

Starting with Apache Airflow version 2.10, the OpenLineage Airflow Provider automatically collects lineage from supported hooks. Hook-based lineage enables lineage collection from custom operators, PythonOperator, and DAGs using the TaskFlow API.

Note

Users on earlier versions of Airflow starting with 2.3 should build the integration using the externally maintained openlineage-airflow package. You can read more about setting up this plugin in the OpenLineage documentation.

To consume and visualize data lineage from Airflow using open-source tools, Astronomer recommends running OpenLineage with Marquez as your lineage metadata repository, query API (backend), and UI (frontend). See the Integrate OpenLineage and Airflow locally with Marquez tutorial to get started. For a configuration-free option for demo purposes, you can explore Marquez on GitPod.

Limitations with open-source lineage

Column-level lineage support with OpenLineage is currently limited to the SQLExecuteQueryOperator-based operators listed in the OpenLineage Provider docs. The Google operators GCSToBigQueryOperator and BigQueryToGCSOperator also support column-level lineage.
Two core operators, PythonOperator and BashOperator, support OpenLineage, but as these are “black boxes” capable of running any code, your mileage may vary.

If you are using lineage prior to Airflow 2.7, there are a few additional limitations:

Warning

Running OpenLineage with Airflow on Astro prior to Astro Runtime 9 (Airflow 2.7) is not recommended. Prior to Airflow 2.7, OpenLineage could cause tasks to get stuck in the running status indefinitely.

The external integration (for Airflow 2.3 - 2.6) only bundles extractors for some operators. Extractors are needed in order to collect lineage metadata out of the box. To see which extractors exist, check out the OpenLineage repo. To get lineage metadata from other operators, you can create your own custom extractor or leverage the default extractor (in Airflow 2.3+) to modify your Airflow operators to gather lineage metadata.
For lineage metadata via the external integration from an external system connected to Airflow, such as Apache Spark, you’ll need to configure an OpenLineage integration with that system in addition to Airflow.