October 4, 2022

Expanding Data Access and Exchange Inside a Company

C Craig Hubert Astronomer

One of the major hurdles that companies are facing is how to balance the need for autonomy at the team level with the importance of data exchange across multiple teams. How do you empower teams to build their own data products and share insights with other teams? How do you make sure the data being shared is clean and trustworthy?

This desire for increased data awareness provides opportunities for data scientists to empower colleagues by creating a centralized ecosystem that handles operational tasks more efficiently. With operational burdens out of their hands, data teams have more time to focus on analysis instead of maintaining pipelines, and with the help of Astro — the fully managed service from Astronomer built on top of Apache Airflow® — siloed data teams are brought together under a common orchestration framework with shared, accessible data.

To explore some of these questions and opportunities, we spoke with Astronomer’s VP of Data & Machine Learning, Steven Hillion, and Taylor Merrick, a Senior Data Engineer, about the technical ecosystem the data science team at Astronomer has built, the challenges of integrating data across the company, and expanding data exchange.

Answers have been lightly edited for clarity and length.

How is the data science team expanding data access and exchange across Astronomer?

Taylor Merrick: What we’ve been working through is how different teams are enabled to use our data and interact with the data science team. At Astronomer there are many different types of data users, and many people are interested in making strategic business decisions based on data. So we’re really trying to enable every kind of user on their data journey, and define how we help them go from a very basic level of using data — where they say, for example, want dashboards — to having them contribute to our general data warehouse.

What sparked these conversations?

TM: We’re at a company with a lot of data engineers who can build their own pipelines and use data like they’re siloed data teams. We want to ensure that there’s a source of truth and trust in the data and that the right people are using the correct data. Astronomer is moving so quickly that our small data science team can’t take on every team’s data needs. So, we’re at this point now where the data science team has operationalized a lot of things. We can now help a team like Customer Success, who have a lot of data needs and have started building out their own data team, as well as other teams who don’t have their own siloed data team.

Steven Hillion: This is what you want, a centralized team that is managing standards. If you look right now, there are more people here outside the data science team doing data analytics and data engineering than on the data science team, which is awesome. There’s 20% of the company creating reports, and the people creating those reports are also building the pipelines. And 50% of the company regularly logs in to view the reports we and others create.

Can you describe what the data team has built for its internal use and how that is introduced to other teams?

SH: We’ve been creating a centralized data warehouse and technical ecosystem for running operational analytics. We felt like the technical ecosystem, and at the heart of that, the data warehouse, had reached a level of maturity where we could start inviting other people in, which would help us standardize best practices, data quality, and re-use of common metrics across the company and make them more efficient.

What does it mean to invite other teams into the ecosystem?

TM: One good example is a team of two people here at Astronomer handling any sales data requests. They’re building their own dashboards, building their own pipelines. For us, now that the data science team is at a point where we can look into what they’re doing and potentially leverage their data, a lot of the work is understanding what is going on with their data and how we can best integrate it.

What are some challenges of integrating data across the company?

TM: Using the previous example, the purpose and need on their side is that they want to perform data analysis, to build reports, but they don’t want to have to maintain their data pipelines and set up ingestion of new data sources. So that’s where we see helping them initially — we’ve already established that we can handle and maintain the operational part. A lot of our pipelines are repeatable. So let’s remove that burden for them so they’re focusing on data analysis work, which they probably prefer.

SH: What Taylor just said about the ingest side of things is really important because we have mechanisms for doing that, which make it pretty easy. We’re saying, “join us in this system we’ve built.” The price of admission is actually not that high. It means you have to use Airflow, but we’ve got tools and utilities that make that easy. Some of those tools the data science team has built ourselves, like custom operators and task groups, and others are components of Astro, like our soon-to-be announced cloud-based IDE for building Airflow DAGs, and Astro Python SDK. The value for other teams is that we will take care of all this new ingestion of data sources. So once you’re in the door, you have Salesforce data, you’ve got marketing data, and this is all clean, documented, and QA’d.

What specific analytic products or services is the data team delivering?

SH: Cleansed datasets is a type of product that other people can use. We have a multilayer process, where you ingest the raw data at the bottom, then it goes through an ETL process to get cleansed data and simple fields that have business meaning. Then there’s an aggregated metrics and reporting layer, the external interface of those data models. It’s very easy to navigate, and we’ve created mechanisms using Airflow to document all of this stuff. We’re also doing predictive modeling. Another data product we have is a well-documented graph of all the datasets in our system through lineage, which is really awesome.

What is the data science team doing now that they couldn’t do at the beginning of this year?

TM: Scale. Since our process is repeatable and easily understood, it gives us the ability to handle more requests and scale that way, but it also gives us the ability to scale outwards in terms of other people besides the data science team contributing.

SH: If we didn’t have Astro, I would have to rethink much of this activity regarding provisioning systems and extending the cloud infrastructure. With Astro, it just happens, which is amazing.

What best practices are you defining around the data used within Astronomer?

TM: There are general guidelines. But what we’re trying to figure out right now is the role of the data science team in establishing these practices versus allowing people to run on their own.

SH: There’s a set of standards that we’ve documented, but I don’t think we’re ruthless in adhering to them. Within the data science team, we are ruthless. But beyond that, it’s a bit through osmosis. We have code reviews — if you want to be under our umbrella, running in our environment, and being on our list of dashboards, then you have to do code reviews and ensure you’re adhering to our naming and schema standards.

What’s on the horizon?

SH: I certainly think more of the same, meaning we want lots more people from other teams contributing more formally. I don’t mean “formal” in a sclerotic way, but with an understanding of our standards and best practices that makes contributing easier.

TM: Because we have so many data-driven people, our company is positioned to get to a point where everyone’s contributing to this and increasing our available, clean data. And where we — the data science team — aren’t the ones being solely responsible for it.

SH: Right now, if you want to get admission to our technical ecosystem, you are either just writing a DAG, which requires a certain skill set, or you are providing us with some code and we’ll build the DAG for you. We’re starting to encourage people to do that part themselves, encouraging them to use our cloud-based IDE, which is a really good way of handling the handover. Then we can think of our ecosystem as a portal that says you’ve got some SQL code or some Python code you want to run, don’t just hand us that raw code — split it up into a sequence of tasks, so there’s a logical separation, and our cloud-based IDE will construct the DAG out of that. You make sure it runs, make sure there are no errors, and then our IDE will pass it to us. So, it’s a way to bring in people who literally have no Airflow experience. It’s almost like filling out a web form. This process is already beginning to happen, and using this to vastly expand the number of people contributing to our operational DAGs is part of the vision.

If you’re looking to empower your teams to build their own data products and share trustworthy insights across the organization, explore Astro. Astro provides a common orchestration framework that data scientists, ML and data engineers, and others can use to acquire and engineer data at every phase of a data science or ML project’s lifecycle.