March 30, 2022

What Is Data Lineage and Why Does It Matter?

Ross Turk Senior Director of Community Astronomer

Running a data pipeline used to be a mystical and unapproachable task, one very few of us were prepared to do. It required the coordination of significant financial and operational resources, and most organizations were able to conjure enough appetite to run exactly one.

Now, every part of the modern data stack can be deployed in the cloud. Data engineering is never easy, but these days basic transformations and visualizations can be built from nothing in mere minutes. The cost is low enough to fit into an expense report, if not free. This is why there’s now so much data, in so many different shapes and sizes, being created by an increasingly large group of people from different backgrounds, many of whom aren’t on the data team.

The means of producing datasets have been democratized, to the point where any team can create and share data, and that data can quickly become business-critical. In short, data has become a driver of community, and of all the diversity, conflict, and creative ferment that word can imply.

But this can lead to fragmentation. As if the enormity of data lakes isn’t brain-bending enough, now we have ten thousand data puddles. There’s more to data movement than any individual can comprehend, much less keep track of, without being able to see it. To operate in today’s distributed data ecosystems, you need a complete and up-to-date picture at all times.

And you can’t have one without data lineage.

What Is Data Lineage?

Data lineage is a way of tracing the complex set of relationships that exist among datasets within an ecosystem.

As different systems in an organization produce and consume data, they establish implicit parent/child relationships among datasets. For example, a table of quarterly sales results might depend on a table containing orders, which might depend on a table containing products. As the organization grows and changes, these relationships can become extremely intricate. A data lineage system records those relationships.

Why Is Data Lineage Important?

If someone asked you to list all of the ways that a city map is useful, you might not know where to start.

You can survive without one, certainly. You might not always choose the best route to your destination, or know how long it will take to get there. When a friend tells you the name of the street they live on, you might not know where it is. If you’re informed of a major storm in the western half of the city, you might not immediately understand whether you’re in danger if you haven’t already spent time with a map. A map helps you comprehend a complicated system.

There is inherent value in broadening our understanding of something: it helps us act, communicate, and collaborate more effectively.

Let’s do a quick exercise. Think of the last time you made an important decision based on data, and ask yourself whether you understand where it came from. You probably know the most proximate source — a particular warehouse, team, or system — but where does that source get its data from? Investigate by asking questions (or reading code, if that’s your thing) until you get all the way to the bottom of it, until there are no more questions to answer. Congratulations! You’ve just completed your first ad hoc data lineage investigation. (If you’d actually done this, you’d know how much work it can be.)

Now imagine doing all of this while you’re trying to fix something, while people are upset. If you just felt your heart rate go up, that’s normal.

Fortunately, it doesn’t have to be a manual process.

As data moves through a pipeline, jobs are observed and metadata is captured.

Where Data Lineage Meets Data Classification

While understanding the journey of data is crucial, categorizing data based on its sensitivity and importance—known as data classification—complements data lineage by enhancing data security strategies and compliance efforts. Data lineage provides a detailed map of data’s origins, transformations, and destinations, which is invaluable for troubleshooting, decision-making, and compliance.

Data classification further divides this data into categories, making it easier to manage according to its confidentiality level and compliance requirements. Together, these processes ensure a strong framework for managing data across its lifecycle, safeguarding against breaches, and maintaining data integrity.

Data Lineage Tools

OpenLineage

OpenLineage is a broad open-source effort that focuses on establishing a common language for working with lineage. It provides a standard framework for automatically tracing data lineage in real time, providing you with a map that you can use to find your way around a complicated and fragmented pipeline. In other words, OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used.

Marquez

Marquez is the reference implementation of OpenLineage. While OpenLineage is a specification that defines how lineage is discussed and described, Marquez is a lineage metadata repository. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. It enables highly flexible data lineage queries across all datasets, while reliably and efficiently associating (upstream, downstream) dependencies between jobs and the datasets they produce and consume.

Data Lineage Use Cases

Data lineage has the potential to augment every part of the data ecosystem. Let’s have a look at some common use cases:

Data operations

When dealing with a pipeline issue, it’s important to have complete situational awareness. Data lineage can provide the context required to distinguish a root cause from a symptom, obviating the need for impromptu investigation during an outage, when time is critical.

Data quality

A quality issue with a dataset can have far-reaching consequences, especially for downstream pipeline tasks that are likely unaware they are working with bad data. Data lineage can determine the full scope of the issue and provide a list of downstream datasets that must be refreshed after the quality issue is resolved.

Data governance

When dealing with sensitive personal or financial information, regulations often require exact knowledge of the extent of its spread throughout various systems. Data lineage can show how datasets were created and consumed, automating the onerous and time-consuming process of compliance certification.

Task Hierarchy Versus Data Lineage

If you’re a user of Airflow, you might be asking yourself, “Isn’t this what I can see in the graph view of my DAG? If I use Airflow for everything, won’t that give me data lineage?”

While the Airflow graph view looks similar, there is a fundamental difference. The graph view shows a hierarchy of tasks, indicating the order in which they must be executed. It’s a task-to-task view, where each node represents a thing that must be done and each connection represents an execution-time dependency. Data lineage, on the other hand, maps dataset-to-dataset relationships, where each connection represents a dataset being consumed or produced.

So, what’s the practical difference?

Other than the dependency made explicit in the DAG, Airflow tasks don’t have to be related to one another in any way. They often are, but they don’t have to be. If you wint, you could create a dependency between two tasks that operate on completely unrelated datasets and have completely unrelated purposes. That’s part of the flexibility of Airflow: you can define task relationships in a way that ensures smooth, optimized execution. And when it comes time to debug a DAG, you might not know where to start without first understanding its set of task dependencies. This task-to-task view is extremely useful when debugging an issue that is contained within a single DAG.

However, when troubleshooting a pipeline issue that spans multiple DAGs — or perhaps even multiple orchestration tools — it’s important to understand the extent of the impact. If a job fails to produce fresh data for downstream jobs, for example, the team operating that job might not be aware. Or a team might waste time troubleshooting a failure without knowing it was caused by an upstream job completely outside of their control. Using dataset-to-dataset mapping, these downstream and upstream jobs can easily be identified, even if they span architectural or organizational boundaries. In this way, it is useful for troubleshooting issues that have widespread and unknown consequences.

How Astronomer Can Help

Data lineage is a big topic, but it’s easy to get started. Watch this webinar to learn how Airflow and OpenLineage work together. Then, get hands- on with this tutorial that shows how data lineage can automate a frequently repetitive task: backfills.

Astronomer brings Airflow and OpenLineage together to give you unmatched visibility into your data ecosystem — no assembly required.

With our lineage support, you can:

Resolve Data Outages Faster Identify root causes, determine impacts, and remediate issues that cause data downtime with less effort.

Make Sense of Cross-Team Dependencies Explore and understand complex dependencies across pipelines, environments, and clouds.

Visualize Quality and Performance Over Time Pinpoint bad data and bottlenecks sooner, and quickly remediate impacts throughout your data ecosystem.

Read our blog "Ways to Extract Data Lineage with Airflow" to learn more about implementing these solutions.

To learn more about how Astronomer brings together Airflow and OpenLineage, connect with us here.