Apache Airflow for Data Leaders — How to Empower Data Teams

  • Steven Hillion
  • U

At Astronomer, we serve a wide range of data practitioners, including data engineers, data architects, machine learning engineers, business analysts, and data scientists. Our goal is to provide all of them with the most frictionless experience of Apache Airflow possible, which means understanding all their varied needs, and especially the needs of their team leaders.

To talk about data orchestration with Airflow from the perspective of a data leader, we reached out to someone intimately familiar with both the role and what Airflow can do: Steven Hillion, Astronomer’s own VP of Data.

What does your data team do, and how does it do it?

My data team analyzes data generated by the Astronomer and Airflow ecosystems, to provide business insights that directly impact how other Astronomer teams do their work. For example, we tell the support engineers how quickly organizations are upgrading to the latest version of Airflow, and whether they’re running into any problems — information that lets them deliver more value to customers.

To be able to do this, we turned to the same modern data orchestration platform that Astronomer offers to customers. Astronomer doesn’t just offer a self-hosted and SaaS-managed Airflow with commercial support — we rely entirely on Airflow ourselves to manage all our data pipelines and analytics. It enables us to obtain the right data at the right time and in the right format, and then push the insights and models into our lines of business.

Additionally, as frontline users of Airflow within Astronomer, we provide information about the platform to our Product team, such as what opportunities and challenges exist, as well as how we can improve the project for ourselves and the community.

What are the main goals of data and analytics leaders?

Three things:

  1. They need to deliver accurate metrics in a reliable manner

This is the bread and butter of a data team, of course, and it’s still surprisingly hard to do well. You need consistent definitions of basic business metrics, and you need to make sure that those are well documented, accurate, and reliable. But beyond that, you need to get out of the way and empower each department to create its own metrics that meet those same standards.

  1. They have to create operational analytics and models that change the way the company does business

Data leadership is not about static insights and static metrics anymore, but about production models that plug directly into applications to improve people’s day-to-day work and enhance the end-user experience. It’s challenging enough just to deliver reliable metrics, but it’s even more of a challenge to deliver predictive models that yield accurate predictions — in an ethical and timely manner — directly into end-user applications!

  1. They should recognize valuable insights that emerge from data (above and beyond the insights that the business already demands)

Data teams don’t know what the data is going to tell them until they look at it. It may not even give clear answers to the questions we’ve posed, but instead suggest new questions. So data leaders have to be prepared to take projects in a different direction from what the business wanted. Indeed, data leaders and their teams sometimes even have to act as disruptors, challenging the way businesses are run based on new insights.

What pressures are data leaders under today?

There was a time — now long gone — when data teams were mostly concerned with generating PowerPoint decks and interpretive models for optimizing business strategy. Companies today demand more from these teams, which they see as integral to business operations. They want to use more sophisticated processing methods to produce insights that will drive proactive strategy.

Which means that data leaders, now seen as an essential part of executive leadership, are under increasing pressure to deliver results beyond basic reports and dashboards — even as they also have to keep up with an ever-growing variety of technologies, techniques, and data, and keep hiring and training new talent. It’s easy to get overwhelmed.

As a result, the maturity and sophistication of many data teams still lag behind expectations. Most organizations have moved beyond static insights, but surprisingly few have a robust infrastructure of machine learning and analytics workflows integrated into everyday operations.

What are the most common mistakes data leaders make?

First, as a data leader, it can be tempting to take on more innovative but less important projects. It’s impossible to do it all — there will always be more metrics or models to work on than you have time for. In my experience, it’s often as important to serve up the right set of business KPIs as it is to do something more advanced. Ask most organizations how many customers they have and how many they lost last week, and they won’t be able to answer you.

Second, if each team uses its own set of technologies and there is no shared infrastructure, it is difficult to obtain a clear, unified picture and make sense of data. On the other hand, if the data ecosystem is rigid and uses a limited set of technologies, the teams’ productivity suffers, and there is no room for innovation. Getting that balance right is difficult — most major organizations suffer from both proliferating silos and sclerotic data warehouses.

How can data leaders empower their organizations?

For an organization to make the most of its data, the data team needs to act as a force multiplier, empowering analysts across the organization through a self-service model.

As a data leader, you can empower these analysts, and through them the whole business, by providing them with the analytics — metrics, dashboards, and models. But beyond that, you also need to ensure that data is as open, available, scalable, secure, and observable as possible. This is not a simple task. Bringing the data sources online, making sure they are clean, and integrating them with other sources poses an enormous challenge, because you’re rarely dealing with the same technologies or inputs — more typically, you’re faced with conflicting information, shifting schemas, ever-changing product catalogs, etc. Fortunately, Apache Airflow is a flexible data orchestration platform that can assist you in integrating disparate data sources, connecting tools seamlessly, and tracking down changes — regardless of the technology.

In addition, leaders have to equip their teams for high productivity by making sure that the standards and tools they create are simple to use, well-integrated, and easy to operationalize. Airflow can help here, too, by allowing them to create reusable components and process standards. In a sense, Airflow can be a nexus where employees discover and use the “right” data and tools.

How does Airflow help data teams do what they need to do?

Once my team has developed a new dataset, report, or model, we usually embed it within an Airflow DAG and then… rarely have to think about it again. I have hundreds of pipelines and tasks running every day, delivering insights to the whole organization, and I hardly ever have to worry about dependencies, compute, or errors.

As an orchestration framework, Apache Airflow is wired into the technical environment in which data teams work, making the process of creating ML models or writing DAGs very smooth. It’s enormously adaptable and flexible, which means it can work with most data professionals’ tools and technologies.

Airflow is completely agnostic with regard to data or ML techniques. Because it’s a mature OSS project with a plethora of extensions, data teams don’t have to start from scratch when integrating Airflow within the modern data stack.

Airflow has broad adoption by data engineers, but it’s been interesting to see how it’s used by data scientists and machine learning engineers. Apart from bringing data together and running production pipelines, another way we use Airflow is to run experiments — building and trying out new predictive models long before putting them into production.

Why do data leaders trust Astronomer?

Astronomer makes it easy for data teams to adopt and use Airflow. Here’s an example from personal experience: When I started at Astronomer, I wanted to deliver insights and reports in a matter of days — not to have to wait months for the whole infrastructure to be ready.

Luckily, I had Astro — our data orchestration platform powered by Airflow. As easily as one can spin up a data warehouse with Databricks or Snowflake, I fired up my Airflow deployments in Astronomer to run my SQL queries, Spark jobs, and ML models on a regular basis. I had them analyze all the company data and provide valuable insights within days.

Have questions about Apache Airflow and Astro? Schedule a meeting with one of our experts to learn more!

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.