November 3, 2022

OpenLineage: Where It Came from and What Comes Next

C Craig Hubert Astronomer

Lineage is becoming an essential tool for data teams operating within distributed data ecosystems. The reason is simple: if one of your many DAGs fails, how are you supposed to solve the problem quickly without knowing what the problem is, where it’s located, and how to prevent it from happening in the future?

The answer for many, increasingly, is OpenLineage, a framework that helps make data pipelines observable by automatically collecting and correlating detailed information about their operation and the data movement within them. A better view of data pipelines saves time and resources across an organization by enabling data engineers to quickly find, fix, and prevent complex operational issues and explore and understand elaborate dependencies across pipelines, environments, and clouds.

The arrival of OpenLineage, which has been around since 2020, coincides with a shift in thinking around data pipelines. In recent years, the growth of the data industry — including an increasing number of tools that are used for managing data pipelines and the increasing maturity of open source projects — has led to a broader push toward democratization of the entire data ecosystem. Instead of questions around what kind of pipeline to build and how to build it, data teams are now asking, “What is going on inside my pipelines and what can I learn from them?”

Currently, OpenLineage is at an “inflection point,” according to Julien Le Dem, the Chief Architect at Astronomer and one of the founders of OpenLineage. “In the past six months, we’ve had more and more people reaching out to us about OpenLineage. It’s been a snowball effect — people are starting to see traction and adoption of the tool, and the more demand there is for lineage the more incentive it creates for the entire ecosystem to join.”

OpenLineage didn’t arrive out of thin air, of course. It’s been a long, winding road toward implementation, involving commitment by many people to a number of open source projects that eventually led to the creation of the OpenLineage standard. And Le Dem, who has spearheaded the project since its inception, says there’s more on the horizon.

The Road to Data Lineage

Le Dem, who grew up in Normandy, began engaging with the data ecosystem at Yahoo, where he worked from 2007 to 2011. “I was building platforms on top of Hadoop for supporting Yahoo verticals,” he says. “There was a lot of acquisition of data feeds, extracting information, and connecting things, and we built a platform on top of Hadoop to help with that.” While at Yahoo, Le Dem began contributing to Apache Pig, an open source platform for analyzing large datasets.

After moving to a new role at Twitter in 2011, where he served as the technical lead for processing tools on the company’s data platform, Le Dem ran into an issue. He and his team were responsible for storing all the analytics data on Hadoop, enabling it, and processing it for data scientists and analysts. But Hadoop wasn’t always sufficient for their needs. “At the time, Hadoop could store a lot of data,” he says. “But it was high-latency querying — a person would launch a job, go get coffee, and come back.” Twitter was also using Vertica, a columnar data storage platform that was a much faster query engine but didn’t scale as much as Hadoop, meaning they couldn’t put as much data into it.

There was a need for something that combined the strengths of the two tools. “I began looking into how we can make Hadoop a little bit more like Vertica,” Le Dem says. “I reread Google’s Dremel paper and started prototyping the algorithm that’s described in that paper. Hadoop was inspired by Google papers about Google File System and MapReduce, so it seemed like a natural extension.” Le Dem called the project Red Elm — an anagram of “Dremel” — and began reaching out to others to gauge interest.

One of the first groups to respond to Le Dem was the team at Cloudera, which was already working on a similar project for Apache Impala, a distributed SQL query engine for Hadoop. “They were very aligned with what I was trying to do, which was build something like Vertica on top of Hadoop,” Le Dem says, and the teams were soon collaborating. “The Impala team at Cloudera brought actual query engine requirements. On my end, I brought all the compatibility with the existing Hadoop stack.”

The result, in 2013, was Apache Parquet, which quickly attracted contributors. “Criteo, an ad company in Europe, and Netflix started collaborating with us,” Le Dem says. “And from there, we reached escape velocity — it just kept growing because there was a critical mass of people using it.”

Building a Map for Data Pipelines

In 2017, when Le Dem was the Senior Principal Engineer at WeWork, he began thinking about the entire data platform, as opposed to the narrower focus of Parquet. “I was convinced there was a big missing piece in collecting all the jobs and all the datasets,” he says. “So that's where we started this open source project called Marquez,” a precursor to OpenLineage.

In retrospect, “we were trying to build Google Maps for data pipelines,” Le Dem says. “There was a need to build a map of how all the jobs and datasets depend on each other. It was about providing this visibility, understanding where the data you're consuming is coming from, and understanding where the data you’re producing is going to.” Le Dem describes it as finding the “root cause” of the problem as opposed to only identifying a symptom.

As with other projects Le Dem had worked on, it was important that Marquez be open source. “We didn’t want to just build Marquez from WeWork’s perspective,” he says. “We reached out to other companies and asked, ‘is this something everybody needs?’ Instead of doing something proprietary, let's collect feedback from other companies and make sure Marquez is generic in the right way, and we'll eventually get some adoption. We don’t have to carry the load of building it all on our own.”

Le Dem left WeWork in 2020, wanting “to push the vision of data observability” as the mission of a new company. This led to Datakin, which was acquired by Astronomer in 2022. When Datakin started, the product was using Marquez as its foundation, but the leadership team decided the best way forward was to spin off OpenLineage as a separate project. “OpenLineage was the most reusable part of Marquez,” Le Dem says. “We wanted to focus on the smallest, most reusable things, in order to create momentum and build adoption.”

How OpenLineage Is Becoming the Industry Standard

Le Dem says the “pie in the sky” vision for OpenLineage was for it to become the standard across the industry, which is already underway with contribution from established companies like Microsoft, Snowflake, and Manta. But lineage also has wider implications for the entire data ecosystem. “With lineage, you start understanding how everything is being consumed and you understand the workload as a whole,” Le Dem says. With this knowledge, “you can optimize, reduce costs, and reduce duplication. It's really about capturing the state of the system so there's a lot we can build around it to enable people to understand, optimize, and be compliant with their data.”

Lineage has become important to the banking industry, for example, where regulation requires banks to provide column-level lineage, while privacy regulations such as GDPR necessitate knowing where private user data is going and when and how it should be deleted. “OpenLineage is very extensible by design, so we can enable the collecting of data for many different things,” Le Dem says.

In the near term, there’s an expectation that lineage will become more complex as data ecosystems grow, and that lineage will increasingly work alongside orchestration to automate operational tasks such as root cause analysis and backfills — in short, to make pipelines better.

Ultimately, Le Dem sees himself as playing one role among many in the continued growth of OpenLineage. To describe this process, he likes to tell a story about soup from a children’s book. In the story, a man sets up a huge pot of boiling water in a town square and puts a stone in it. When people ask him what he’s doing, he says he’s making stone soup for everyone in town. But stones don’t have taste, so he invites neighbors to add any ingredients they wish to improve the soup. They toss in carrots, onions, and other spices, making the soup more of a collective effort that meets the needs of the entire town.

“The more people contribute, the more valuable it becomes for everybody,” Le Dem says. “It’s still stone soup — I’m just stirring the pot.”

OpenLineage: Where It Came from and What Comes Next

The Road to Data Lineage

Building a Map for Data Pipelines

How OpenLineage Is Becoming the Industry Standard

Build, run, & observe your data workflows. All in one place.

Build, run, & observe
your data workflows.
All in one place.