How we optimized the Registry for performance across millions of page views

  • Julian LaNeve
  • Ian Moritz

Part of the appeal of Airflow is its rich ecosystem - if you’re working with common data tools and products, chances are there’s already functionality for interacting with those tools and products in Airflow via the provider system. There are massive benefits to an ecosystem like this, but there are also natural challenges. When there are over 1,000 operators available to you, it’s difficult to know which operator to use and whether or not a given operator has the functionality you require. As a data engineer, when you want to find a new operator for your use case, you’re stuck Googling and searching GitHub; and when you do find one, to understand it, you’re typically stuck reading the operator’s source code to understand what it does and how to configure it.

In 2021, we first released the Astronomer Registry to help solve these challenges by making Airflow providers, operators, and example DAGs easier to discover and use. If you’re curious, you can learn more about the initial release in the original blog post.

Building the Registry was no small undertaking: we have to programmatically parse Airflow operator source code across many repositories and aggregate it in a way that is consistent, searchable, and easy to use. Early on, we accepted that there were likely to be data issues across the board given this was the first attempt at standardizing this info, so we wanted an architecture that would let us iterate quickly and patch data issues when we noticed them. Ultimately, we landed on the following:

Airtable was crucial here - as we discovered data quality issues with parameter descriptions, types, and names, we could immediately fix the issue with Airtable’s UI while we wait for the docs fix to be merged and released in the provider package itself. This worked really well! We scaled to millions of page views with no issues and were able to react very quickly to reported issues.

The big challenge of this architecture is that it’s completely independent of everything else at Astronomer. This makes it tough to do new feature development and tough to bring the Registry closer to Astro since there’s no traditional API and engineers need to be onboarded to an unfamiliar tech stack. The data hosted in the Registry is very helpful and tying into the authentication infrastructure we have at Astronomer means we can provide more personalization, so we decided to rebuild it.

The stack is similar, but there are a few key differences. Now, we have:

One of the guiding principles of the new Registry was performance. Since it’s responsible for serving relatively static data, we wanted to make it work as quickly as possible. We designed the database with normalized tables to minimize redundant data and make the schema easy to understand and maintain, but it wasn’t as quick as we wanted. The design called for multiple joins per read request which affected the performance of our read operations. We didn’t want to change the schema, so we started looking for other options.

Materialized views are used pretty frequently in analytics applications, but they’re not incredibly common for more traditional applications. While the frequency of reads on the Registry is quite high, the frequency of writes is pretty low - they only happen when a user wants to publish a new provider, or a provider publishes a new release. Creating materialized views for the common read operations means those operations are as quick as possible. Given these advantages (and the fact that we’re working with relatively small datasets), we decided materialized views were the way to go.

With the new database design and a Golang-based API, the new Registry runs at an order of magnitude quicker: most pages are down from ~1 second on the old Registry to just a few hundred milliseconds on the new version!

We’re going to keep making improvements to the Registry and build new features on top of it. Stay tuned for more, and in the meantime, go check it out!

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.