Introducing the First Generative AI Cookbook for Data Orchestration
4 min read |
Generative AI’s effectiveness is heavily reliant on the quality and
orchestration of data. Though increasingly versatile and knowledgeable, to
be useful to the business, Generative AI (GenAI) models need access to
rich, proprietary datasets and real-time operational data streams to
create truly differentiated applications.
As discussed in our earlier post, The Dividing Line Between Generative AI
Success and
Failure,
Apache Airflow®, a standard for data orchestration, plays a crucial role
in managing complex data and machine learning workflows, enabling teams to
build GenAI apps grounded with their enterprise data. The post also
highlighted a number of engineering teams already using Apache Airflow®,
managed by Astronomer’s Astro data
platform, to release enterprise grade
GenAI apps faster with higher quality and at lower cost.
One of the most common questions we get asked when discussing data
orchestration for GenAI is how to get started. That is what our new GenAI
Cookbook
is designed to answer.
Why a cookbook?
As state-of-the-art in AI advances, the stack of technologies needed to
build an enterprise grade GenAI application is complex and rapidly
evolving. Understanding how and where data orchestration integrates into
the stack was the primary driver behind developing the cookbook.
In the cookbook, we demonstrate how Airflow is the foundation for the
reliable delivery of AI applications through six common GenAI use cases:
-
Support automation
-
E-commerce product discovery
-
Product insight from customer reviews
-
Customer churn risk analysis
-
Legal document summarization and categorization
-
Dynamic cluster provisioning for image creation
For each use case we discuss its benefits to the business along with
common technical challenges before presenting a detailed reference
architecture.
Each reference architecture is built on a full stack of GenAI technologies
— from embedding and foundation models to vector databases, search
engines, retrieval frameworks, and cloud services. Don’t worry if you
don’t see your own preferred technology included in a specific reference
architecture. Because the Astronomer
Registry curates Airflow providers for
many components of the AI stack and Airflow allows for any custom Python
code to run in a task, you can easily swap out one technology or cloud
platform for your preferred option.
A few tasters from the cookbook
We’ve worked to incorporate a cross section of the most common generative
AI use cases we encounter in the community. To give you a taster of what
to expect, we’ve extracted two examples below.
Support Automation
The first use case is an example of conversational AI using GenAI to power
a user-facing chatbot for answering support questions. Rather than
showcase a simple prototype, in our reference architecture the chatbot
learns from interactions to continuously improve its performance.
This solution uses Airflow running on Astro to orchestrate an application
that includes data retrieval, embedding, and reranking for Retrieval
Augmented Generation (RAG). The pipeline periodically fine-tunes a Cohere
Rerank model, and uses a feedback loop to fine-tune answer generation with
Cohere Command R+ over time. Vector embeddings of proprietary information
are stored in a Weaviate database. Airflow’s provider ecosystem provides
specialized operators to interact with Weaviate, Amazon EKS and a hook to
connect to Amazon Bedrock, alongside implementation of custom Python
functions and operators to transform text data with LangChain.
Customer Churn Risk Analysis: combining GenAI with traditional ML
Classifying customer churn risk using GenAI for sentiment analysis and
traditional machine learning for classification is a hybrid approach that
enhances prediction accuracy and helps businesses proactively target
customers at risk of churning with personalized retention strategies. The
LLM interprets customer interactions in-context to extract customer
sentiment as a valuable feature, which is then used in combination with
features derived from customer information by a traditional ML model to
classify customers based on their churn risk.
This solution shows how Airflow can orchestrate both traditional MLOps
with the ingestion of user information and feature creation, as well as
sentiment analysis via a language model. Customer messages from Slack
channels and Hubspot are combined and aggregated, Llama 3 hosted on Amazon
Bedrock conducts sentiment analysis in-context which creates an additional
feature for the XGBoost classifier model run with Amazon SageMaker. Churn
risk scores are saved to an Amazon Redshift database which powers a
dashboard.
Next steps to getting started
The six reference architectures presented in the cookbook show Apache
Airflow® providing a robust data orchestration framework to operationalize
and scale Generative AI workflows. From ingesting and preprocessing
training data, to deploying and monitoring models in production, to
governing the full AI lifecycle, Airflow enables teams to reliably
integrate, run, and govern the complex data pipelines that power AI-driven
applications.
Our teams are ready to run workshops with you to discuss how Airflow and
Astronomer can accelerate your GenAI initiatives, so go ahead and contact
us
to schedule your session.