Day 2 Operations for LLMs with Apache Airflow®: Going Beyond the Prototype, Part 1
In late 2023, the world is undergoing a clear shift from an “AI Summer” of endless possibilities with Large Language Models (LLMs) to an “AI Autumn” marked by pragmatism and hard work as prototypes run into the challenges of enterprise adoption. A new crop of LLM application development frameworks are simplifying development but lack features that operational teams have come to expect to build reliable, sustainable, and auditable workflows.
Meanwhile, Apache Airflow® is at the core of many of these teams’ technology stacks, and when combined with vector databases, LLMs, and the LLM development frameworks, it facilitates the creation of enterprise-grade workflows that feed a new category of applications that are creating real business value.
In this blog, we discuss the challenges of LLM application development beyond the prototype and why Apache Airflow® was highlighted by Andreessen Horowitz’s Emerging Architectures for LLM Applications for its ability to enable day-2 operations for LLM applications.
LLM Application Paradigms
A large and rapidly growing ecosystem of publicly available LLMs and LLM application development frameworks (e.g. LangChain, LLamaIndex, Unstructured, Haystack) enable organizations to build application prototypes at a fraction of the cost and time it took even a year earlier. Much of the current application development efforts can be bucketed into three high-level approaches:
- Prompt Engineering: Using the models as-is and optimizing applications based on different types of prompts.
- Retrieval Augmented Generation (RAG): Adding context to prompts based on retrieval and search results from a domain-specific corpus of data.
- Fine Tuning: Creating purpose-built or optimized models by further tuning model parameters from domain-specific data, adding application-specific model layers or ensembling models.
This blog focuses on RAG-based LLM development as it allows enterprises to tap into their intellectual property and private data to add unique, value-added context to applications. RAG-based applications do, however, add operational complexity in the form of processes to load (i.e. extract, transform, vectorize, index, and import), as well as to maintain, the unstructured data which forms the basis of this competitive advantage.
Day-2 Operations
LLM development frameworks, such as LangChain, have created simple and extensible frameworks for building prototype pipelines for document processing and loading. However, as these applications become business-critical or as the prototype moves to production, additional aspects are required to create the reliability, availability, serviceability, scalability, explainability, and auditability expected of enterprise-grade applications.
Loading data: Document loading is an essential part of RAG-based development. Most frameworks include constructs for building pipelines to extract data from sources, chunk the data to match LLM context windows, vectorize and finally load the data into a vector store. When operationalizing these prototypes many enterprises will encounter challenges with:
Scale: Many frameworks are single-threaded and do not have an in-built ability to parallelize. This can result in missed SLAs when processing very large documents, or large numbers of documents and data sources.
Logging, auditing: Few frameworks have anything beyond basic Python logging in order to keep track of what data was ingested and when, or which errors occurred and why. This creates a heavy burden on operational teams to manually debug workflows or trace data origins for governance.
Lineage: Understanding the heritage of a piece of data and how datasets relate to each other is an important part of all data operations. It is even more important for machine learning (including LLM) use-cases where explainability is sometimes vital for good governance and regulatory compliance. None of the frameworks researched for this piece allow a developer to easily trace generated LLM responses back to original documents and document versions.
Atomicity and idempotency: During LLM development it is often necessary to experiment with different models, parameters or chunking strategies. When pipelines are built without atomic and idempotent units of work it is usually necessary to rerun the entire pipeline when only one task (ie. chunking) is changed. Most LLM development frameworks lack specific features to capture and replay “state” between tasks and task runs, or the ability to easily pick individual tasks for replay. When processing a large corpus of unstructured documents this can result in increased development time, slower experimentation iterations, and increased costs for both compute and model APIs.
Error handling and retry: Lack of atomicity also creates challenges in error handling. This is especially important when dealing with API throttling from both the LLM itself as well as many data source systems. LLM development frameworks often don’t have built-in features for automatic retry, exponential backoff or partial result processing.
Inter-workflow dependencies and triggering: Vector store pipelines for RAG-based applications are often only one of many related pipelines within the scope of enterprise data operations. By themselves, the LLM development frameworks don’t account for triggering from or to other workflows. This is especially troublesome when different teams develop different workflows which must work together as a seamless, orchestrated whole.
Maintaining data: Importantly, the process of loading data is not something that happens only one time. With LLMs, in particular, the need to continually monitor and update vector data stores is something that most frameworks do not yet accommodate. As enterprise adoption of LLMs increases operational teams may find that the frameworks don’t account for the following:
Scheduling for freshness: Document embeddings can be thought of as a context-based summarization and, as documents change or new documents become available, it is important to be able to regularly update the vector store. This means that ingest pipelines need to be able to account for all of the challenges above AND run on periodic schedules (or triggering events) to keep up with the speed of business changes.
Data refill: The strategy and technology used to chunk the data and generate embeddings has a direct result on LLM application performance (e.g.. search accuracy, conversational quality, etc.). In nearly every case, it is necessary to try different chunking strategies over time. Likewise, new LLMs are becoming available almost monthly and re-embedding documents can result in a considerable increase in application performance. When development frameworks don’t account for refilling of vector stores all of the effort for loading data is duplicated for the operational team.
Feedback processing: Many applications will have mechanisms for capturing user feedback which is often extremely valuable for continuous improvement of both the interfaces and the underlying data. LLM development frameworks like LangSmith show great promise for capturing and processing that data. However additional tools are needed to automate and orchestrate these processes in the broader context of the enterprises’ data orchestration.
To summarize, LLM application development frameworks are incredibly useful tools for building LLM prototypes but lack features for day-2 operations.
There’s an app for that…
It may be obvious at this point that these are all challenges the industry has seen before. In reality, LLM application loading is almost identical to the normal extract, transform, and load (ETL/ELT) processes of other data pipelines and should be treated with the same rigor and engineering discipline. Important differences arise, however, with pipelines for RAG-based applications due to the “living” nature of document embeddings which change over time, rapid evolution of models, standards and techniques, and the importance of feedback loops. These factors make it even more important to build operational workflows with an enterprise-grade orchestration framework.
Leading analysts highlight Apache Airflow® for its ability to bring day-2 operations to LLM development frameworks. In addressing the challenges listed above, Apache Airflow® provides:
- Atomicity and idempotency: DAGs and Tasks are the fundamental building blocks of data orchestration with Airflow. LLM application code prototypes can be broken into atomic units of work with little recoding. Airflow keeps track of history and state for both DAGs and Tasks.
- Error handling and retry: Airflow task state is also important for automatic retry of tasks (and their upstream or downstream dependencies). This allows LLM developers to quickly experiment with, for instance, different chunking strategies without needing to execute the entire pipeline or manually rebuild state for an experiment. Likewise, failed tasks can be set to automatically retry which becomes very useful when encountering API rate limits, throttling, or network errors.
- Data refill: By building DAGs as idempotent units of work, the entire process of document embedding can be replayed in Airflow to account for multiple vector databases (e.g. dev and prod), experiment with different vector database providers, or perform A/B testing across different strategies or models.
- Scheduling for freshness: Airflow’s advanced scheduler system allows teams to build incremental ingest DAGs that regularly scan document stores and ingest or update vector stores to keep search results fresh and accurate.
- Integrations: Apache Airflow® has a massive number of integrations to nearly every data source needed for LLM document loading and also first-class integrations with vector databases like Weaviate for simple, maintainable code.
- Scale: Airflow tasks can be easily scaled horizontally to process large numbers of data sources dynamically. Likewise, multiple options for Airflow executors, as well as integrations with compute frameworks like Spark and Dask, bring vertical scalability for memory compute-intensive processing of large data.
- Logging, auditing: Every operation in Airflow is logged to facilitate not only auditing but also debugging. With centralized logging, LLM developers don’t need to worry about searching for error messages only to find they were not captured. Additionally, data lineage is captured for Airflow pipelines, simplifying not only governance but also the explainability of LLM results.
- Lineage: Apache Airflow® integrates with frameworks like OpenLineage to provide detailed tracking and observability. This makes it possible for developers to better understand and debug hallucinations and other errors by tracking LLM responses back to specific chunks, documents and data sources. This is also extremely important for informing experimentation with different chunking strategies.
- CI/CD: Airflow’s pipeline-as-code approach means that LLM developers can build RAG pipelines that snap into the enterprises’ existing frameworks and software development standards.
- Inter-workflow dependencies and triggering: In reality, the RAG-based pipelines are only one of many pipelines that will likely be needed to feed enterprise-grade LLM applications. Many enterprises rely on Airflow to consolidate all of their data orchestration, which means that RAG pipelines can snap into the existing frameworks and trigger, or be triggered by, other workflows. Data aware scheduling makes it possible to trigger workflows based on updates to dependent data.
Apache Airflow® has become the glue that holds the modern data stack together. This is as true for MLOps as for traditional DataOps. As a Python-based tool, Airflow integrates well with all of the most popular LLM development frameworks and enables enterprises to not only prototype LLM applications quickly but also to operationalize them and build production-quality workflows.
If you’re interested in getting started with Apache Airflow® for MLOps, you can spin up Airflow in less than 5 minutes with a free trial of Astro.
Other articles in this series