Skip to main content
Version: Airflow 3.x

Apache Airflow® Quickstart - Generative AI

Generative AI: An introduction to generative AI model development with Airflow.

Step 1: Clone the Astronomer Quickstart repository

  1. Create a new directory for your project and open it:

    mkdir airflow-quickstart-genai && cd airflow-quickstart-genai
  2. Clone the repository and open it:

    git clone -b generative-ai-3 --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/generative-ai

    Your directory should have the following structure:

    .
    ├── Dockerfile
    ├── README.md
    ├── airflow_settings.yaml
    ├── dags
    │ └── example_vector_embeddings.py
    ├── include
    │ ├── custom_functions
    │ │ └── embedding_func.py
    │ └── data
    │ └── galaxy_names.txt
    ├── packages.txt
    ├── requirements.txt
    ├── solutions
    │ └── example_vector_embeddings_solution.py
    └── tests
    └── dags
    └── test_dag_integrity.py

Step 2: Start up Airflow and explore the UI

  1. Start the project using the Astro CLI:

    astro dev start

    The CLI will let you know when all Airflow services are up and running.

tip

At this time, Safari will not work properly with the UI. If Safari is your default browser, use Chrome to open Airflow 3.0.

  1. If it doesn't launch automtically, navigate your browser to localhost:8080 and sign in to the Airflow UI using username admin and password admin.

  2. Explore the DAGs view (landing page) and individual DAG view page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.

    For example, the DAGs view will look like this screenshot:

    Airfllow UI DAGs view

Running the DAG a few times will allow the processes to run, and allow you to see exactly what is happening with your DAG.

Airfllow UI DAG specific view

You can also go into the tasks and check the logs which will show you exactly what is going on with that task.

Airfllow UI DAG specific view

Step 3: Explore the project

Apache Airflow is one of the most common orchestration engines for AI/Machine Learning jobs, especially for retrieval-augmented generation (RAG). This project shows a simple example of building vector embeddings for text and then performing a semantic search on the embeddings.

The DAG (directed acyclic graph) in the project demonstrates how to leverage Airflow's automation and orchestration capabilities to:

  • Orchestrate a generative AI pipeline.
  • Compute vector embeddings of words.
  • Compare the embeddings of a word of interest to a list of words to find the semantically closest match.
warning

This project uses DuckDB, an in-memory database, for running dbt transformations. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!

For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).

Pipeline structure

An Airflow project can have any number of DAGs (directed acyclic graphs), the main building blocks of Airflow pipelines. This project has one:

example_vector_embeddings

This DAG contains six tasks:

  • get_words gets a list of words from the context to embed.

  • create_embeddings creates embeddings for the list of words.

  • create_vector_table creates a table in the DuckDB database and an HNSW index on the embedding vector.

  • insert_words_into_db inserts the words and embeddings into the table.

  • embed_word embeds a single word and returns the embeddings.

  • find_closest_word_match finds the closest match to a word of interest.

Next Steps:

Run Airflow on Astro

The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.

Further Reading

Here are a few guides that may help you learn more about the topics discussed in this quickstart:

Was this page helpful?