We sat down with Bolke de Bruin, newly appointed VP of Enterprise Data Services at Astronomer, to talk about major data management trends, staying at the forefront of transformations, and the unique position of Apache Airflow in the greater tech ecosystem.
Before joining Astronomer, Bolke was leading data teams at ING Bank. He co-founded and scaled an Artificial Intelligence department that built highly successful end-to-end products. Passionate about agile product development and cultural diversity, he transformed teams with measurable outcomes. He is also one of the top Apache Airflow committers, deeply involved in making it an integral part of the modern data stack across industries.
How would you describe the Airflow community?
The Apache Airflow community is genuinely welcoming. I remember my first PR—the reaction was amazing, my changes were welcomed and appreciated. It’s worth mentioning that it was around 12K lines of code! The community is also very vibrant. There are people starting in the data engineering space, who just found out that cron doesn’t solve the problem anymore. Others are from small companies or large enterprises and are interested in maintaining Airflow at scale. They all have different needs and demands, but that all accounts for making Airflow as it is today. Also, I think that every company nowadays becomes a data company—it simply has to deal with data orchestration one way or another.
We are so happy to welcome you on board!
Astronomer got me in on its stance on diversity. The team dynamics are truly unique. I’m fascinated by getting perspectives from different cultures that will make products, processes, strategies—and everything in between—better. This is also exemplified by the way Astronomer deals with a community as large as Airflow. It’s striking a balance between the commercial aspect and making Airflow better serve many companies and purposes. We exist because of community efforts, and we acknowledge and nurture our roots.
How does Airflow fit into a bigger data industry picture?
I think the major thing is the evolving role of the data engineer in organizations. What Maxime Beauchemin wrote in his article “The Rise of Data Engineer” was a perfect illustration of the challenges and changes a few years back. It is a good thing that Airflow retains most of the concepts that were new at that time. Airflow is an excellent workflow orchestration tool which “understands” that dealing with real-world data is pretty tough. It has your back dealing with dependencies, retries, and backfills. All things we nowadays take for granted. However, I think the needs of teams using Airflow are changing. So to stay relevant and support these teams, we should aim for change - change without forgetting the past.
What are the major trends at the intersection of data & business?
I would start with decentralization.
We now see the enterprises move to DevOps teams and product teams that are relatively decentralized. They are empowered to do end-to-end product development. It can be part of the product, but they retain end-to-end responsibility. For example, look at Spotify: they have a team responsible for the playlist and another team responsible for recommendations all the way from the backend to the user interface. These teams typically iterate very fast and they like consuming services because that allows them to move at the right pace. They all share a hunger for data, and at the same time, they create a tremendous amount of it too. A central data team cannot keep up with this. It’s moving too fast. Therefore, the responsibility of maintaining datasets is shifting from the central teams to the product teams. So if I’m responsible for a part of the product, I’m going to be responsible for its generated data as well. Since the products are getting more complicated, so is the level of data management.
Secondly, modern DevOps and product teams are also changing the game.
DevOps teams are generally multi-skilled. They’re responsible for end-to-end product development, and they act autonomously. They focus on product viability—making sure one can sell it, desirability —making it useful for the customer, and feasibility—the ability to build it. The modern product teams consist of UX researchers, analytics and machine learning engineers, data scientists, software engineers, product managers, etc. Because they’re multi-skilled and want to get the product out as soon as possible, tools like Airflow have to adapt to that way of fast development.
With Airflow, in particular, these shifts mean that we need to support that iterative process, cater to the needs of analytical and machine learning engineers to assist them on their journeys. All that while maintaining the strengths we already have! We do see this opportunity, and we’re going to move in that direction.
Thirdly, the productionization of data.
Well, if you were reading between the lines you can see that I think that building operational, scalable, observable, and resilient data systems is only possible if the data itself is treated with the diligence of an evolving, iterative product. This requires tooling that is inherently data aware. Data discovery with tools like Datakin and Amundsen can make transparent what data is driving your revenue. Integration of data quality tools like PopMon and Great Expectations help us monitor this important data and stay within compliance boundaries. Having all these things in place will greatly speed up your business processes.
What are the big changes behind those trends?
Any company that operates via knowledge-intensive decision-making processes needs to deal with data. This can be a financial institution like a bank calculating the mortgage. Or a medical company creating a Covid vaccine. Or simply an analyst working on a pitch deck for a potential customer or a report. This in itself is not new, but what is new is that companies now realize how much data goes into these decisions, how dependent they are on that data and how fast the needs for data are changing.
The question now becomes how to get the right data into that decision-making process in time and with the right quality. We need more self-service tooling to ensure that highly-skilled product teams and data teams do not have to wait for a central data team. The analyst should be able to create his/her pitch deck or report in Tableau, Superset, or PowerBI with a data pipeline being automatically generated when required. At the same time, as an enterprise, you want to set some checks and balances in place so that you know that data can only be accessed by the right people, data leakage is being prevented and privacy is respected. That’s the shift we’ve been witnessing—and need to start supporting.
What would be the potential risks of mishandling data?
If you come from highly regulated industries like banking or healthcare, it’s all about risk mitigation and compliance. It is in the nature of their operations. You can’t “undo” a transaction and pivot. You can’t simply miscalculate someone’s mortgage or a dose of medicine and say ‘I am sorry,’ because it has real-life implications. So, that is the reason why these types of companies spent a lot of effort to ensure the integrity of the process leading up to these calculations by, for example, four-eyes principles and manual checks. Data orchestration and tools like Airflow reduce the risks by automation and monitoring and can therefore reduce cost. With Airflow, we’re putting data orchestration on steroids because of these requirements of the modern world.
What’s unique about Apache Airflow?
Airflow is uniquely situated because it’s the spider in the web. It helps you orchestrate your data; so you know where it’s coming from, what system it operates on, what processes are changing the data, and where it’s going. You need to facilitate the engineers to make these kinds of actions easy. You can drive the speed of these product teams because you actually drive data discovery and enable data monitoring.
Airflow is uniquely positioned also because of its heritage—it’s supported by the Apache Software Foundation and us at Astronomer. My advice? Follow our blogs, training materials, guides—Airflow is flexible; you can mold it according to your needs and focus on your goals with the support and power of the community. You’ll be able to deliver top-level services and products to your customers, and we’ll be here assisting you along the way. It’s a win-win situation.
If you want to dive deeper into the world of Airflow with our experts, sign up for one of the upcoming webinars!