June 4, 2024

Introducing the First Generative AI Cookbook for Data Orchestration

Kenten Danas Senior Manager, Developer Relations Astronomer
Tamara Fingerlin Developer Advocate Astronomer

Generative AI's effectiveness is heavily reliant on the quality and orchestration of data. Though increasingly versatile and knowledgeable, to be useful to the business, Generative AI (GenAI) models need access to rich, proprietary datasets and real-time operational data streams to create truly differentiated applications.

As discussed in our earlier post, The Dividing Line Between Generative AI Success and Failure, Apache Airflow®, a standard for data orchestration, plays a crucial role in managing complex data and machine learning workflows, enabling teams to build GenAI apps grounded with their enterprise data. The post also highlighted a number of engineering teams already using Apache Airflow®, managed by Astronomer's Astro data platform, to release enterprise grade GenAI apps faster with higher quality and at lower cost.

One of the most common questions we get asked when discussing data orchestration for GenAI is how to get started. That is what our new GenAI Cookbook is designed to answer.

Why a cookbook?

As state-of-the-art in AI advances, the stack of technologies needed to build an enterprise grade GenAI application is complex and rapidly evolving. Understanding how and where data orchestration integrates into the stack was the primary driver behind developing the cookbook.

In the cookbook, we demonstrate how Airflow is the foundation for the reliable delivery of AI applications through six common GenAI use cases:

Support automation
E-commerce product discovery
Product insight from customer reviews
Customer churn risk analysis
Legal document summarization and categorization
Dynamic cluster provisioning for image creation

For each use case we discuss its benefits to the business along with common technical challenges before presenting a detailed reference architecture.

Each reference architecture is built on a full stack of GenAI technologies — from embedding and foundation models to vector databases, search engines, retrieval frameworks, and cloud services. Don’t worry if you don’t see your own preferred technology included in a specific reference architecture. Because the Astronomer Registry curates Airflow providers for many components of the AI stack and Airflow allows for any custom Python code to run in a task, you can easily swap out one technology or cloud platform for your preferred option.

A few tasters from the cookbook

We’ve worked to incorporate a cross section of the most common generative AI use cases we encounter in the community. To give you a taster of what to expect, we’ve extracted two examples below.

Support Automation

The first use case is an example of conversational AI using GenAI to power a user-facing chatbot for answering support questions. Rather than showcase a simple prototype, in our reference architecture the chatbot learns from interactions to continuously improve its performance.

This solution uses Airflow running on Astro to orchestrate an application that includes data retrieval, embedding, and reranking for Retrieval Augmented Generation (RAG). The pipeline periodically fine-tunes a Cohere Rerank model, and uses a feedback loop to fine-tune answer generation with Cohere Command R+ over time. Vector embeddings of proprietary information are stored in a Weaviate database. Airflow’s provider ecosystem provides specialized operators to interact with Weaviate, Amazon EKS and a hook to connect to Amazon Bedrock, alongside implementation of custom Python functions and operators to transform text data with LangChain.

Customer Churn Risk Analysis: combining GenAI with traditional ML

Classifying customer churn risk using GenAI for sentiment analysis and traditional machine learning for classification is a hybrid approach that enhances prediction accuracy and helps businesses proactively target customers at risk of churning with personalized retention strategies. The LLM interprets customer interactions in-context to extract customer sentiment as a valuable feature, which is then used in combination with features derived from customer information by a traditional ML model to classify customers based on their churn risk.

This solution shows how Airflow can orchestrate both traditional MLOps with the ingestion of user information and feature creation, as well as sentiment analysis via a language model. Customer messages from Slack channels and Hubspot are combined and aggregated, Llama 3 hosted on Amazon Bedrock conducts sentiment analysis in-context which creates an additional feature for the XGBoost classifier model run with Amazon SageMaker. Churn risk scores are saved to an Amazon Redshift database which powers a dashboard.

Next steps to getting started

The six reference architectures presented in the cookbook show Apache Airflow® providing a robust data orchestration framework to operationalize and scale Generative AI workflows. From ingesting and preprocessing training data, to deploying and monitoring models in production, to governing the full AI lifecycle, Airflow enables teams to reliably integrate, run, and govern the complex data pipelines that power AI-driven applications.

Our teams are ready to run workshops with you to discuss how Airflow and Astronomer can accelerate your GenAI initiatives, so go ahead and contact us to schedule your session.

Introducing the First Generative AI Cookbook for Data Orchestration

Why a cookbook?

A few tasters from the cookbook

Next steps to getting started

Build, run, & observe your data workflows. All in one place.

Build, run, & observe
your data workflows.
All in one place.