CRED is a members-only bill-payment platform based in India, serving more than 9 million individuals with high credit scores. It provides its members with secure, seamless payment experiences, useful tools such as scheduling alerts, payment tracking, and detailed spending analyses, and rewards like deals on brands and travel from the online CRED Store.
CRED’s platform requires a robust infrastructure that can integrate data from both the finance and lifestyle sides of the company, as well as from public cloud services like credit reporting agencies and payment card processors. In addition to CRED’s data engineering team, four other internal teams work with and consume data from this platform: data science, data analytics, lending analytics, and operations.
“When CRED launched in 2018, we developed an analytics foundation that was small but really strong, so everything was manageable,” says Deepanshu Rai, who leads data infrastructure at CRED. “It was all in silos: if someone on the operations team wanted to run some script, they would spin up an EC2 machine and do that in a cron job, while the data analytics and data engineering teams were using Apache NiFi to schedule and manage dataflows.”
As CRED grew, the need for a centralized data orchestration solution that would solve for dependencies across teams and improve observability and pipeline reliability was apparent. “We had to build a lot of custom logic just to monitor pipelines, and we had to check the data constantly to be sure pipelines were running well,” Rai says, attributing these needs to the disconnection between NiFi and cron.
From NiFi to a Complete Orchestration Solution
A few members of CRED’s data engineering team had experience with Apache Airflow, and believed it might offer a solution. After the team spun up an Airflow proof of concept, the results convinced them to migrate from NiFi.
The team built an abstraction framework that they and other teams at CRED could use to quickly map their NiFi dataflows and translate them to Airflow DAGs. “We reached out to each team, saying: ‘We’ve created this easy-to-use migration framework,’” Rai says. “They could paste in their queries, specify their schedule parameters, and, if they wanted, configure alerting for their tasks. They didn’t need to know anything about Airflow.” As a result, the migration was “surprisingly smooth,” Rai says, and CRED’s Airflow-powered workflows were soon performing reliably, positioning the company to meet aggressive new service-level agreements.
But running Airflow and the software stack it depends on presented its own challenges, particularly when it came to tuning performance and upgrading and managing Airflow and related software. The operations team wanted to rapidly experiment with new performance-enhancing features that came with every new Airflow release, which meant continuous — and time consuming — upgrades of the system. And it became clear that operating Airflow on a daily basis would require several full-time employees, and that just to avoid unexpected downtime, CRED would need to employ at least one site reliability engineer (SRE).
Considerable time was getting spent on troubleshooting issues with software dependencies and writing Airflow DAGs, and they had no guarantee that DAGs that had been built locally would run reliably in the Airflow production environment. The team identified that a sure-fire solution for local Airflow development, as well as a secure path for deploying DAGs from the local environment all the way through to production, was needed. These and other factors convinced CRED to look for alternatives. “We quickly saw that we needed to look for a managed solution,” Rai says.
Astro, the fully managed orchestration service powered by Airflow, offered what CRED was looking for. “We found a trusted partner in Astronomer,” Rai says. “Migrating to Astro has released the team’s bandwidth and given us an effective solution for managing infrastructure software issues.”
A New and Intuitive View
Onboarding teams to Astro — including ones with little Airflow knowledge — was eased by the Astro UI. “It’s very intuitive in terms of creating a new deployment or workspace,” says Manas Bhardwaj, a Data Engineer at CRED. “Teams now find it easy to oversee everything when they log into Astro. When they go to a deployment space, they can see all the worker CPUs and worker memory over the last 24 hours, how many DAGs are running, and how many DAGs failed. They get a lot of metrics at the task level. So if something looks wrong, they can find the culprit and act accordingly.”
In addition, Astro filled a number of critical orchestration gaps for CRED, giving it the ability to implement role-based access control, integrating with its version control system and CI/CD processes, and enabling the company to offer teams a consistent, reproducible local development environment. With Astro, users can locally build, debug, and perform unit tests on their DAGs before deploying them.
Alerting and Experts to Rely On
Astro also gives CRED the ability to integrate email and Slack with Airflow, enabling it to push alerts to users in real time, “Before, if a pipeline failed, we only knew about it because of proactive systemic data checks,” Rai explains. “Now we get built-in alerting capabilities with Astro, and the process is extremely reliable.”
This makes it possible for ops personnel, data engineers, and other stakeholders to immediately respond to pipeline failures. Airflow’s support for data lineage extraction — and Astro’s built-in ability to aggregate and analyze lineage metadata — make it much easier for these experts to diagnose and resolve data outages, which almost always result from undocumented changes to upstream sources.
And when there are issues with Airflow or its infrastructure, the data teams know they can count on the experts at Astronomer. “Early on, I looked at the Astro UI and noticed that worker utilization was low — much lower than it had been in our self-hosted Airflow environment,” says Bhardwaj. “So I raised a ticket with Astro Support, asking if they could analyze our workloads and look at any worker conflicts. They quickly provided a solution — we didn’t need as many cloud resources to run our Airflow workers in Astro, so they actually recommended that we switch to a configuration where we were using fewer cloud resources.”
Orchestration that Drives Growth
The migration to Astro has allowed CRED to standardize on a single orchestration platform and helped improve the reliability and resilience of its distributed dataflows — especially those that crisscross multiple internal teams, or consume data (via APIs) from cloud services. By eliminating what were siloed data ingest and preparation steps, CRED has been able to accelerate the delivery of business-critical reports and analytics, like a daily report predicting potential credit defaults that gets distributed to CRED’s collection team for follow-up action. CRED has also created Airflow DAGs to ingest and prepare the data it consumes from credit bureaus, which it uses to establish credit lines and monthly loan payment rates for its customers. Previously, it had used a combination of NiFi and cron to accomplish this, with unpredictable results — and with almost no insight into pipeline failures.
Astro transformed CRED’s retail operations, too, making it much easier for the company to deliver rich analytics capabilities to the merchant partners that sell products to its members through CRED’s online store. “Each merchant gets a report at the end of the day telling them how their products are performing,” Rai says. “During the migration, we built a custom Airflow operator to completely automate this process. You can imagine how much time we saved — on average, the teams had been spending around three to four hours each day. As well as how much more attractive this is to our merchant partners.”
Going forward, Bhardwaj says, CRED is planning to take advantage of Astro’s accessible UI to enable more stakeholders from teams across the organization to create their own data pipelines, and to make use of Astro’s lineage capabilities for impact analyses. And having gotten so much out of Airflow and Astro, he adds, “we look forward to contributing back to the Airflow open source project.”
Based in India, CRED is a members-only bill-payment platform that serves more than 9 million people.
CRED’s growth led to a need for a centralized and fully managed data orchestration solution that could improve observability, boost pipeline reliability, and handle dependencies across teams.
The Astro Solution
Astro addressed CRED's orchestration needs with its consistent and reproducible local dev environment, role-based access control, integration with version control and CI/CD processes, and built-in alerting.
Astro’s in-place upgrades let the CRED operations team experiment with new performance-enhancing features that come with every Airflow release.
CRED teams can locally build, debug, and perform unit tests on their DAGs before deploying them to production.
Astro’s ability to aggregate and analyze lineage metadata helps CRED’s experts diagnose and resolve data outages.