VTEX Achieves Consistency Across Its Data Environments with Astro
Learn how a huge digital commerce platform cut through the complexity and got things running smoothly with the fully managed, Airflow-powered orchestration service.
A Powerful — and Complex — Digital Commerce Platform
Founded in 2000, VTEX is an enterprise digital commerce platform, with more than 3,200 active online stores and 2,400 brands and retail customers in more than 38 countries worldwide. Customers can sell directly to consumers through its retail platform — taking advantage of an array of tools for processing orders, managing stock levels, and marketing — and can also use its seamless integrations to sell products efficiently through Amazon, eBay, and Alibaba marketplaces.
“There’s an extra level of complexity because we do this for thousands of customers with different business rules, who all require distinct, complex workflows,” says Igor Tavares, Principal Data Engineer at VTEX. When Tavares arrived at the company in 2021, he says, the situation was “challenging”: the company was using five different programming languages, and different teams, like People Ops and Growth Ops, had developed their own solutions for triggering, deploying, and monitoring jobs. The Data team had issues onboarding new hires because of the number of programming languages they needed to master, and team members were also finding it difficult to monitor dataflows across the company. They often had no idea whether a job had failed or not run at all, and sometimes didn’t even know when a job had been triggered, or by whom.
And the fix they came up with wasn't really efficient: “We started to connect the individual teams’ solutions, and it created a spider web of different technologies,” says Tavares. “It started to become too complex.”
The Need for Data Reliability
At the end of 2021, VTEX turned to a managed Airflow infrastructure service to bring coherence to their data ecosystem, but it ultimately proved difficult to maintain. They expected to have to work upfront to build on top of the infrastructure, but were not anticipating the time they would need to put in to help keep everything running smoothly. “It required many hours from our team,” says Tavares.
The managed service, which lacked a pre-built Airflow runtime, could not easily be adapted to support a reproducible local development environment in which users could build, test, and debug their pipelines. Instead, VTEX created a development environment in the service’s cloud Airflow infrastructure, but found remote development painfully slow: tests took two to three minutes to start, and it could take dozens of test runs to debug common errors. Plus, because of unavoidable software mismatches in their remote development and production environments, they couldn’t count on the pipelines they built and tested in the former to always run reliably in the latter.
It wasn’t uncommon for VTEX’s data engineers to spend an entire night restarting their Airflow environments, struggling to troubleshoot software mismatches. “We were always playing this risky game of making the environments unstable and then taking hours to recover the healthy state again,” Tavares says.
The difficulty of achieving consistency across environments eventually led VTEX to look at a fully managed orchestration service powered by Airflow, Astro, which they adopted in April 2022.
Support and Guidance from the Airflow Experts
When the company was considering Astro, Astronomer’s team of Airflow committers and other experts in the platform was a deciding factor. “That made us confident that things could be different with Astronomer,” Tavares says. “They move the community forward, so they should know more about keeping things stable than anybody else.”
Tavares and his team were immediately impressed by the field engineers they worked with during the initial setup of Astro. “They were hands-on,” he says. “They set up frequent sessions for code review, and worked with us to build a CI/CD process based on our requirements. We were able to migrate to Astro and have all of our old pipelines running in production a lot faster than we’d expected.”
The team at Astronomer was also able to help Tavares and Diogo Falcão, Senior Data Engineer on the Data Foundations team at VTEX, develop an operator they built from scratch to support Amazon AppFlow, a service for transferring data between SaaS applications. “It was our first contribution to the Airflow community,” Tavares says.
“We learned a lot over a short period of time, setting everything up and migrating all of our jobs,” he adds. "It was clear that we were in good hands.”
Immediate Access to the Latest Airflow Improvements
With their previous orchestration service, Tavares and Falcão found it frustrating that new releases of Airflow weren’t supported for weeks or months. Typically, VTEX’s cloud Airflow service was running at least one version behind the latest version of Apache Airflow®. And when upgrading to the latest supported versions of Airflow, VTEX couldn’t perform in-place updates to their existing Airflow environments; instead, they had to spin up new, separate deployments and clone their Airflow configurations over to these. Once they did, VTEX couldn’t take advantage of important new Airflow capabilities — like deferrable operators, for example — unless the service fully supported them.
With Astro, VTEX enjoys same-day, in-place upgrades to the latest versions of Airflow; upgrading is as easy as editing a file — changing a single value — to run on current stable Airflow. “It’s great for us to have immediate access to these new versions of Airflow, so we can use many of the tools that were previously unavailable to us,” says Falcão.
The Benefits of a Solid, Managed Infrastructure
The biggest single benefit of moving to Astro was a local development environment — the Astro CLI — in which to safely and confidently experiment. “Now, we can test our DAGs and connect remotely to our Redshift cluster while being sure that everything we are doing here in our machine will also work on the production cluster,” Falcão says. “It’s been very significant to us.”
That confidence — and the time they saved on debugging — has allowed VTEX’s Data team to move much faster and expand the use of Airflow across the company. “The People Ops team now extracts insights from our recruitment pipelines, for example, so they understand who are the better recruiters and who are the better interviewers,” says Tavares. “The Growth Ops team now uses Astro to pull Salesforce data, which generates dashboards and reports for the sales team.”
“We’ve been able to make orchestration accessible to them in a way that we would never have had time to if we were still trying to keep the Airflow environments healthy,” adds Falcão.