OpenLineage Is on the Rise in 2023

  • Ross Turk
  • Steve Swoyer

If you don’t know about Data Council, you’re missing out! Data Council conferences offer great opportunities to network with other data professionals, share knowledge, acquire new skills, and soak up expertise about data engineering, data science, and anything data-related.

Data Council Austin will take place March 28-30 at the AT&T Hotel and Conference Center in Austin, Texas. The schedule features tracks in data engineering and infrastructure, data science and algorithms, and applied / generative AI, among others.

Just as important, Data Council Austin also provides an excellent opportunity to learn about OpenLineage, the robust open-source standard for collecting and analyzing lineage metadata. If you’re already familiar with OpenLineage, join us at Data Council Austin for a chance to dig into it more deeply. And if you’re not, read on to see why OpenLineage helps with compliance and data governance, and provides a dependable foundation for data-driven decision-making.

We’d love to connect with you at Data Council Austin! Come hang out with fellow data practitioners at the Data Wranglers Happy Hour on Wednesday, March 29. Then join us the next day — Thursday, March 30, 12:15-1:30pm — for an overview of OpenLineage.

OpenLineage Explained

Achieving an end-to-end view of data lineage is one of the most elusive goals in all of data management and data governance. The heterogeneity of tools, practices, and processes, combined with the inevitability of business change, are well-known complicating factors. But for a long time, one of the biggest roadblocks to achieving an end-to-end view of data lineage was the lack of an open and shared standard for lineage metadata.

Enter OpenLineage, the open standard that integrates with a large — and growing! — number of data engineering, data cataloging, and metadata management products, tools, and technologies. Pipeline operators can use OpenLineage to capture lineage metadata from their workflows and persist it in a supported backend, such as Astro, Egeria, Manta, or Marquez, the last of which is a reference implementation of the OpenLineage specification.

Support for OpenLineage continues to build. The OpenLineage project just capped its most successful year ever, with a surge in new active contributors, built-in support for OpenLineage in product offerings from Manta, Keboola, Microsoft Purview, and other vendors, and, not least, the release of new OpenLineage Airflow extractors that support Trino, Amazon S3, and Amazon Sagemaker, among other sources.

This year, OpenLineage is poised to get even better, with improved stability and performance via OpenLineage proxies, which reduce latency for OpenLineage clients and also buffer lineage events, should the OpenLineage backend become unavailable. Plus, the influx of new community participants is helping fuel extraordinarily important and foundational conversations about the future of the spec, along with the use cases it can apply to. Other changes on tap for 2023 and beyond include improved integration with Apache Airflow, and compatibility between OpenLineage and additional data sources, data catalogs, and metadata management technologies.

2022 Was a Landmark Year for OpenLineage

According to data from the Linux Foundation, which hosts OpenLineage as one of 35 subprojects in its Artificial Intelligence and Data (AID) foundation, the number of new contributors to OpenLineage increased by 172% between February 2022 and February 2023, with a 163% increase in active contributors. During this same period, the average active contributor pushed 2,780 commits, while the 12-month growth rate for commits increased by a staggering 156%

Contributor strength” is a metric used to quantify the skills, abilities, and expertise of all of the people who contribute to an open-source project. OpenLineage’s contributor strength increased by 130% over this period. For all AID subprojects, inclusive of OpenLineage, contributor strength grew by 21%, during the same period, and for all projects hosted by the Linux Foundation, it grew by just 9%.

Industry Support Coalesces around OpenLineage

In addition to Astronomer, 22 other organizations contribute to OpenLineage, including several Fortune 500 and Global 500 companies.

Many open-source projects and commercial software products now offer pre-built support for OpenLineage, either as consumers (Apache Egeria, Manta) or producers (Apache Airflow, Dagster, Databricks, dbt, Apache Egeria, Apache Flink, Great Expectations, Keboola, Snowflake, and Apache Spark) of OpenLineage metadata.

Astro, the fully managed, Airflow-powered orchestration platform from Astronomer, is both: with Astro, your Airflow operators and OpenLineage extractors automatically produce OpenLineage metadata, which gets consumed by Astro’s observability backend.

Airflow Primed to Deliver First-Class Integration with OpenLineage

Airflow, too, is ramping up to provide an enriched OpenLineage experience, courtesy of a proposed Airflow-OpenLineage provider that will be developed and maintained by the Apache Airflow project.

The new provider will be built into the base Airflow Docker image, making it easier to configure and use it. Another proposed change — a new, optional API for Airflow operators — will eliminate the need for custom extractors: so long as an operator takes advantage of the new Airflow-OpenLineage integration (and includes a set of relevant unit tests), there will be no need to create extractors.

Not only will this make it easier to collect lineage events from your Airflow tasks, but it will also make this process much more reliable: the current OpenLineage-Airflow implementation is tightly coupled with Airflow’s internals, such that changes to those internals (or to Airflow’s existing OpenLineage provider) can break integration with the OpenLineage backend. The proposed changes aim to create a more flexible Airflow-OpenLineage implementation, with more robust integration.

See OpenLineage in Action

If last year was huge for OpenLineage, 2023 is shaping up to be even more exciting.

Catch the excitement at this month’s Data Council Austin conference, where, on Thursday, March 30, you can participate in a lunchtime session on OpenLineage. Come discover how OpenLineage simplifies compliance and auditing, promotes collaboration, improves data quality, accelerates troubleshooting, and provides a trusted foundation for data-driven decision-making.

Intrigued? Use the code OpenLineage20 to save 20% on registration.

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.