Use DuckDB with Apache Airflow
Info
This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.
DuckDB is an open-source in-process SQL OLAP database management system. It allows you to run complex queries on relational datasets using either local, file-based DuckDB instances, or the cloud service MotherDuck. The ability to create a local DuckDB instance is useful for testing complex Airflow pipelines without the need to connect to a remote database.
Airflow can interact with DuckDB in two key ways:
- Use the DuckDB Python package directly in @task decorated tasks. This method is useful if you want to do ad-hoc analysis in-memory or combine information stored in various DuckDB files.
- Connect to DuckDB via the DuckDB Airflow provider. The DuckDB Airflow provider is ideal if you access the same DuckDB database from many tasks in your Airflow environment and want to standardize this connection in a central place. You can also use the DuckDBHook to create custom operators to modularize your DuckDB interactions from within Airflow.
Other ways to learn
There are multiple resources for learning about this topic. See also:
- Webinar: How to use DuckDB with Airflow.
- Example repository: Astronomer’s DuckDB example repository.
Time to complete
This tutorial takes approximately 15 minutes to complete.
Assumed knowledge
To get the most out of this tutorial, make sure you have an understanding of:
- The basics of DuckDB. See the DuckDB documentation.
- Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
- Airflow decorators. See Introduction to Airflow decorators.
- Airflow connections. See Manage connections in Apache Airflow.
Prerequisites
- The Astro CLI.
Step 1: Configure your Astro project
To use DuckDB with Airflow, install the DuckDB Airflow provider in your Astro project. This will also install the newest version of the DuckDB Python package.
-
Create a new Astro project:
-
Add the DuckDB Airflow provider to your Astro project
requirements.txt
file. -
If you are connecting to MotherDuck, the DuckDB cloud service, you need to use the amd64 version of Astro Runtime to prevent package conflicts. In this case, replace the
FROM
statement in your Dockerfile with the following line:If you are only using DuckDB locally, you do not need to modify your Dockerfile.
Step 2: Create a DAG using the DuckDB Python package
You can use the duckdb Python package directly in your @task
decorated tasks. This method does not require you to configure an Airflow connection.
-
Start Airflow by running
astro dev start
. -
Create a new file in your
dags
folder calledduckdb_tutorial_dag_1.py
. -
Copy and paste the following DAG code into the file:
This simple DAG will query all information from a table in a DuckDB instance. Make sure the table you are querying exists in the DuckDB instance you specified in your DuckDB connection.
- Open Airflow at
http://localhost:8080/
. Run the DAG manually by clicking the play button.
Info
You can use the DuckDBHook to create custom operators to modularize your interactions with DuckDB. You can find an example of a custom DuckDB operator for ingesting Excel files here.
Conclusion
Congratulations! You successfully used DuckDB with Airflow. Quack!