Apache Airflow® Quickstart - Generative AI
Generative AI: An introduction to generative AI model development with Airflow.
Step 1: Clone the Astronomer Quickstart repository
-
Create a new directory for your project and open it:
mkdir airflow-quickstart-genai && cd airflow-quickstart-genai
-
Clone the repository and open it:
git clone -b generative-ai-3 --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/generative-ai
Your directory should have the following structure:
.
├── Dockerfile
├── README.md
├── airflow_settings.yaml
├── dags
│ └── example_vector_embeddings.py
├── include
│ ├── custom_functions
│ │ └── embedding_func.py
│ └── data
│ └── galaxy_names.txt
├── packages.txt
├── requirements.txt
├── solutions
│ └── example_vector_embeddings_solution.py
└── tests
└── dags
└── test_dag_integrity.py
Step 2: Start up Airflow and explore the UI
-
Start the project using the Astro CLI:
astro dev start
The CLI will let you know when all Airflow services are up and running.
At this time, Safari will not work properly with the UI. If Safari is your default browser, use Chrome to open Airflow 3.0.
-
If it doesn't launch automtically, navigate your browser to
localhost:8080
and sign in to the Airflow UI using usernameadmin
and passwordadmin
. -
Explore the DAGs view (landing page) and individual DAG view page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.
For example, the DAGs view will look like this screenshot:
Running the DAG a few times will allow the processes to run, and allow you to see exactly what is happening with your DAG.
You can also go into the tasks and check the logs which will show you exactly what is going on with that task.
Step 3: Explore the project
Apache Airflow is one of the most common orchestration engines for AI/Machine Learning jobs, especially for retrieval-augmented generation (RAG). This project shows a simple example of building vector embeddings for text and then performing a semantic search on the embeddings.
The DAG (directed acyclic graph) in the project demonstrates how to leverage Airflow's automation and orchestration capabilities to:
- Orchestrate a generative AI pipeline.
- Compute vector embeddings of words.
- Compare the embeddings of a word of interest to a list of words to find the semantically closest match.
This project uses DuckDB, an in-memory database, for running dbt transformations. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!
For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).
Pipeline structure
An Airflow project can have any number of DAGs (directed acyclic graphs), the main building blocks of Airflow pipelines. This project has one:
example_vector_embeddings
This DAG contains six tasks:
-
get_words
gets a list of words from the context to embed. -
create_embeddings
creates embeddings for the list of words. -
create_vector_table
creates a table in the DuckDB database and an HNSW index on the embedding vector. -
insert_words_into_db
inserts the words and embeddings into the table. -
embed_word
embeds a single word and returns the embeddings. -
find_closest_word_match
finds the closest match to a word of interest.
Next Steps:
Run Airflow on Astro
The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.
Further Reading
Here are a few guides that may help you learn more about the topics discussed in this quickstart:
- Learn how to implement dynamic tasks, for use cases like training multiple models.
- Check out our integrations and connections section for tutorials on using Airflow with many common AI and ML tools.
- Read about rerunning dags and tasks to learn how to manage failures, reprocess historical data, and more.