Orchestrate Weaviate operations with Apache Airflow

Info

This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.

Weaviate is an open source vector database, which store high-dimensional embeddings of objects like text, images, audio or video. The Weaviate Airflow provider offers modules to easily integrate Weaviate with Airflow.

In this tutorial you’ll use Airflow to ingest movie descriptions into Weaviate, use Weaviate’s automatic vectorization to create vectors for the descriptions, and query Weaviate for movies that are thematically close to user-provided concepts.

Other ways to learn

There are multiple resources for learning about this topic. See also:

Webinar: Modern Infrastructure for World Class AI Applications.

Why use Airflow with Weaviate?

Weaviate allows you to store objects alongside their vector embeddings and to query these objects based on their similarity. Vector embeddings are key components of many modern machine learning models such as LLMs or ResNet.

Integrating Weaviate with Airflow into one end-to-end machine learning pipeline allows you to:

Use Airflow’s data-driven scheduling to run operations on Weaviate based on upstream events in your data ecosystem, such as when a new model is trained or a new dataset is available.
Run dynamic queries based on upstream events in your data ecosystem or user input via Airflow params against Weaviate to retrieve objects with similar vectors.
Add Airflow features like retries and alerts to your Weaviate operations.

Time to complete

This tutorial takes approximately 30 minutes to complete.

Assumed knowledge

To get the most out of this tutorial, make sure you have an understanding of:

The basics of Weaviate. See Weaviate Introduction.
Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
Airflow decorators. Introduction to the TaskFlow API and Airflow decorators.
Airflow hooks. See Hooks 101.
Airflow connections. See Managing your Connections in Apache Airflow.

Prerequisites

The Astro CLI.
(Optional) An OpenAI API key of at least tier 1 if you want to use OpenAI for vectorization. The tutorial can be completed using local vectorization with text2vec-transformers if you don’t have an OpenAI API key.

This tutorial uses a local Weaviate instance created as a Docker container. You do not need to install the Weaviate client locally.

Info

The example code from this tutorial is also available on GitHub.

Step 1: Configure your Astro project

Create a new Astro project:

1 $ mkdir astro-weaviate-tutorial && cd astro-weaviate-tutorial
2 $ astro dev init

Add build-essential to your packages.txt file to be able to install the Weaviate Airflow Provider.

build-essential

Add the following two packages to your requirements.txt file to install the Weaviate Airflow provider and the Weaviate Python client in your Astro project:

apache-airflow-providers-weaviate==2.0.0
weaviate-client==4.7.1

This tutorial uses a local Weaviate instance and a text2vec-transformer model, with each running in a Docker container. To add additional containers to your Astro project, create a new file in your project’s root directory called docker-compose.override.yml and add the following:

1 version: '3.1'
2 services:
3   weaviate:
4     image: cr.weaviate.io/semitechnologies/weaviate:1.25.6
5     command: "--host 0.0.0.0 --port '8081' --scheme http"
6     ports:
7       - "8081:8081"
8       - "50051:50051"
9     volumes:
10       - ./include/weaviate/backup:/var/lib/weaviate/backup
11     environment:
12       QUERY_DEFAULTS_LIMIT: 25
13       AUTHENTICATION_APIKEY_ENABLED: 'true'
14       AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'readonlykey,adminkey'
15       AUTHENTICATION_APIKEY_USERS: 'jane@doe.com,john@doe.com'
16       PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
17       DEFAULT_VECTORIZER_MODULE: 'text2vec-openai'
18       ENABLE_MODULES: 'text2vec-openai, backup-filesystem, qna-openai, text2vec-transformers'
19       BACKUP_FILESYSTEM_PATH: '/var/lib/weaviate/backup'
20       CLUSTER_HOSTNAME: 'node1'
21       TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
22     networks:
23       - airflow
24   t2v-transformers:
25     image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
26     environment:
27       ENABLE_CUDA: 0 # set to 1 to enable
28     ports:
29       - 8082:8080
30     networks:
31       - airflow

To create an Airflow connection to the local Weaviate instance, add the following environment variable to your .env file. You only need to provide an X-OpenAI-Api-Key if you plan on using the OpenAI API for vectorization. To create a connection to your Weaviate Cloud instance, refer to the commented connection version below.

## Local Weaviate connection
AIRFLOW_CONN_WEAVIATE_DEFAULT='{
    "conn_type":"weaviate",
    "host":"weaviate",
    "port":"8081",
    "extra":{
        "token":"adminkey",
        "additional_headers":{"X-Openai-Api-Key":"<YOUR OPENAI API KEY>"},
        "grpc_port":"50051",
        "grpc_host":"weaviate",
        "grpc_secure":"False",
        "http_secure":"False"
    }
}'
## The Weaviate Cloud connection uses the following pattern:
# AIRFLOW_CONN_WEAVIATE_DEFAULT='{
#     "conn_type":"weaviate",
#     "host":"<YOUR HOST>.gcp.weaviate.cloud",
#     "port":"8081",
#     "extra":{
#         "token":"<YOUR WEAVIATE KEY>",
#         "additional_headers":{"X-Openai-Api-Key":"<YOUR OPENAI API KEY>"},
#         "grpc_port":"443",
#         "grpc_host":"grpc-<YOUR HOST>.gcp.weaviate.cloud",
#         "grpc_secure":"True",
#         "http_secure":"True"
#     }
# }'

Tip

See the Weaviate documentation on environment variables, models, and client instantiation for more information on configuring a Weaviate instance and connection.

Step 2: Add your data

The DAG in this tutorial runs a query on vectorized movie descriptions from IMDB. If you run the project locally, Astronomer recommends testing the pipeline with a small subset of the data. If you use a remote vectorizer like text2vec-openai, you can use larger parts of the full dataset.

Create a new file called movie_data.txt in the include directory, then copy and paste the following information:

1 ::: Arrival (2016) ::: sci-fi ::: A linguist works with the military to communicate with alien lifeforms after twelve mysterious spacecraft appear around the world.
2 ::: Don't Look Up (2021) ::: drama ::: Two low-level astronomers must go on a giant media tour to warn humankind of an approaching comet that will destroy planet Earth.
3 ::: Primer (2004) ::: sci-fi ::: Four friends/fledgling entrepreneurs, knowing that there's something bigger and more innovative than the different error-checking devices they've built, wrestle over their new invention.
4 ::: Serenity (2005) ::: sci-fi ::: The crew of the ship Serenity try to evade an assassin sent to recapture telepath River.
5 ::: Upstream Colour (2013) ::: romance ::: A man and woman are drawn together, entangled in the life cycle of an ageless organism. Identity becomes an illusion as they struggle to assemble the loose fragments of wrecked lives.
6 ::: The Matrix (1999) ::: sci-fi ::: When a beautiful stranger leads computer hacker Neo to a forbidding underworld, he discovers the shocking truth--the life he knows is the elaborate deception of an evil cyber-intelligence.
7 ::: Inception (2010) ::: sci-fi ::: A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster.

Step 3: Create your DAG

In your dags folder, create a file called query_movie_vectors.py.

Copy the following code into the file. If you want to use text2vec-openai for vectorization, change the VECTORIZER variable to text2vec-openai and make sure you provide an OpenAI API key in the AIRFLOW_CONN_WEAVIATE_DEFAULT in your .env file.

1 """
2 ## Use the Airflow Weaviate Provider to generate and query vectors for movie descriptions
3 
4 This DAG runs a simple MLOps pipeline that uses the Weaviate Provider to import
5 movie descriptions, generate vectors for them, and query the vectors for movies based on
6 concept descriptions.
7 """
8 
9 from airflow.decorators import dag, task
10 from airflow.models.param import Param
11 from airflow.operators.empty import EmptyOperator
12 from airflow.models.baseoperator import chain
13 from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
14 from airflow.providers.weaviate.operators.weaviate import WeaviateIngestOperator
15 from weaviate.util import generate_uuid5
16 import weaviate.classes.config as wvcc
17 from pendulum import datetime
18 import logging
19 import re
20 
21 t_log = logging.getLogger("airflow.task")
22 
23 WEAVIATE_USER_CONN_ID = "weaviate_default"
24 TEXT_FILE_PATH = "include/movie_data.txt"
25 # the base collection name is used to create a unique collection name for the vectorizer
26 # note that it is best practice to capitalize the first letter of the collection name
27 COLLECTION_NAME = "Movie"
28 
29 # set the vectorizer to text2vec-openai if you want to use the openai model
30 # note that using the OpenAI vectorizer requires a valid API key in the
31 # AIRFLOW_CONN_WEAVIATE_DEFAULT connection.
32 # If you want to use a different vectorizer model
33 # (https://weaviate.io/developers/weaviate/model-providers)
34 # make sure to also add it to the weaviate configuration's `ENABLE_MODULES` list
35 # for example in the docker-compose.override.yml file
36 VECTORIZER = wvcc.Configure.Vectorizer.text2vec_transformers()
37 # VECTORIZER = wvcc.Configure.Vectorizer.text2vec_openai(model="ada")
38 
39 
40 @dag(
41     start_date=datetime(2023, 9, 1),
42     schedule=None,
43     catchup=False,
44     tags=["weaviate"],
45     params={
46         "movie_concepts": Param(
47             ["innovation", "friends"],
48             type="array",
49             description=(
50                 "What kind of movie do you want to watch today?"
51                 + " Add one concept per line."
52             ),
53         ),
54     },
55 )
56 def query_movie_vectors():
57     @task.branch
58     def check_for_collection(conn_id: str, collection_name: str) -> bool:
59         "Check if the provided collection already exists and decide on the next step."
60         # connect to Weaviate using the Airflow connection `conn_id`
61         hook = WeaviateHook(conn_id)
62 
63         # check if the collection exists in the Weaviate database
64         collection = hook.get_conn().collections.exists(collection_name)
65 
66         if collection:
67             t_log.info(f"Collection {collection_name} already exists.")
68             return "collection_exists"
69         else:
70             t_log.info(f"collection {collection_name} does not exist yet.")
71             return "create_collection"
72 
73     @task
74     def create_collection(conn_id: str, collection_name: str, vectorizer: str):
75         "Create a collection with the provided name and vectorizer."
76         hook = WeaviateHook(conn_id)
77 
78         hook.create_collection(name=collection_name, vectorizer_config=vectorizer)
79 
80     collection_exists = EmptyOperator(task_id="collection_exists")
81 
82     def import_data_func(text_file_path: str, collection_name: str):
83         "Read the text file and create a list of dicts for ingestion to Weaviate."
84         with open(text_file_path, "r") as f:
85             lines = f.readlines()
86 
87             num_skipped_lines = 0
88             data = []
89             for line in lines:
90                 parts = line.split(":::")
91                 title_year = parts[1].strip()
92                 match = re.match(r"(.+) \((\d{4})\)", title_year)
93                 try:
94                     title, year = match.groups()
95                     year = int(year)
96                 # skip malformed lines
97                 except:
98                     num_skipped_lines += 1
99                     continue
100 
101                 genre = parts[2].strip()
102                 description = parts[3].strip()
103 
104                 data.append(
105                     {
106                         "movie_id": generate_uuid5(
107                             identifier=[title, year, genre, description],
108                             namespace=collection_name,
109                         ),
110                         "title": title,
111                         "year": year,
112                         "genre": genre,
113                         "description": description,
114                     }
115                 )
116 
117             print(
118                 f"Created a list with {len(data)} elements while skipping {num_skipped_lines} lines."
119             )
120             return data
121 
122     import_data = WeaviateIngestOperator(
123         task_id="import_data",
124         conn_id=WEAVIATE_USER_CONN_ID,
125         collection_name=COLLECTION_NAME,
126         input_json=import_data_func(
127             text_file_path=TEXT_FILE_PATH, collection_name=COLLECTION_NAME
128         ),
129         trigger_rule="none_failed",
130     )
131 
132     @task
133     def query_embeddings(weaviate_conn_id: str, collection_name: str, **context):
134         "Query the Weaviate instance for movies based on the provided concepts."
135         hook = WeaviateHook(weaviate_conn_id)
136         movie_concepts = context["params"]["movie_concepts"]
137 
138         my_movie_collection = hook.get_collection(collection_name)
139 
140         movie = my_movie_collection.query.near_text(
141             query=movie_concepts,
142             return_properties=["title", "year", "genre", "description"],
143             limit=1,
144         )
145 
146         movie_title = movie.objects[0].properties["title"]
147         movie_year = movie.objects[0].properties["year"]
148         movie_genre = movie.objects[0].properties["genre"]
149         movie_description = movie.objects[0].properties["description"]
150 
151         t_log.info(f"You should watch {movie_title}!")
152         t_log.info(
153             f"It was filmed in {int(movie_year)} and belongs to the {movie_genre} genre."
154         )
155         t_log.info(f"Description: {movie_description}")
156 
157     chain(
158         check_for_collection(
159             conn_id=WEAVIATE_USER_CONN_ID, collection_name=COLLECTION_NAME
160         ),
161         [
162             create_collection(
163                 conn_id=WEAVIATE_USER_CONN_ID,
164                 collection_name=COLLECTION_NAME,
165                 vectorizer=VECTORIZER,
166             ),
167             collection_exists,
168         ],
169         import_data,
170         query_embeddings(
171             weaviate_conn_id=WEAVIATE_USER_CONN_ID, collection_name=COLLECTION_NAME
172         ),
173     )
174 
175 
176 query_movie_vectors()

This DAG consists of five tasks to make a simple ML orchestration pipeline.

The check_for_collection task uses the WeaviateHook to check if a collection of the name COLLECTION_NAME already exists in your Weaviate instance. The task is defined using the @task.branch decorator and returns the the id of the task to run next based on whether the collection of interest exists. If the collection exists, the DAG runs the empty collection_exists task. If the collection does not exist, the DAG runs the create_collection task.
The create_collection task uses the WeaviateHook to create a collection with the COLLECTION_NAME and specified VECTORIZER in your Weaviate instance.
The import_data task is defined using the WeaviateIngestOperator and ingests the data into Weaviate. You can run any Python code on the data before ingesting it into Weaviate by providing a callable to the input_json parameter. This makes it possible to create your own embeddings or complete other transformations before ingesting the data. In this example we use automatic schema inference and vector creation by Weaviate.
The query_embeddings task uses the WeaviateHook to connect to the Weaviate instance and run a query. The query returns the most similar movie to the concepts provided by the user when running the DAG in the next step.

Step 4: Run your DAG

Run astro dev start in your Astro project to start Airflow and open the Airflow UI at localhost:8080.
In the Airflow UI, run the query_movie_vectors DAG by clicking the play button. Then, provide Airflow params for movie_concepts.

Note that if you are running the project locally on a larger dataset, the import_data task might take a longer time to complete because Weaviate generates the vector embeddings in this task.

View your movie suggestion in the task logs of the query_embeddings task:

[2024-08-15, 13:34:10 UTC] {query_movie_vectors.py:155} INFO - You should watch Primer!
[2024-08-15, 13:34:10 UTC] {query_movie_vectors.py:156} INFO - It was filmed in 2004 and belongs to the sci-fi genre.
[2024-08-15, 13:34:10 UTC] {query_movie_vectors.py:159} INFO - Description: Four friends/fledgling entrepreneurs, knowing that there's something bigger and more innovative than the different error-checking devices they've built, wrestle over their new invention.

Conclusion

Congratulations! You used Airflow and Weaviate to get your next movie suggestion!