Clean up the Airflow metadata database using Dags
In addition to storing configurations about your Airflow environment, the Airflow metadata database stores data about past and present task runs. Airflow never automatically removes metadata, so the longer you use it, the more task run data is stored in your metadata DB. Over a long enough time, this can result in a bloated metadata DB, which can affect performance across your Airflow environment.
When a table in the metadata DB is larger than 50GiB, you might start to experience degraded scheduler performance. This can result in:
- Slow task scheduling
- Slow dag parsing
- Gunicorn timing out when using the Celery executor
- Slower Airflow UI load times
The following tables in the database are at risk of becoming too large over time:
dag_runjoblogrendered_task_instance_fieldstask_instancexcom
To keep your Airflow environment running at optimal performance, you can clean the metadata DB using the Airflow CLI airflow db clean command. This command was created as a way to safely clean up your metadata DB without querying it directly.
In Airflow 3, this command cannot be called from a Dag because tasks can no longer directly access the metadata DB. Instead you can expose Airflow’s utility function used by the command via an HTTP API using an Airflow Plugin. This tutorial describes how to implement the cleanup Dag and corresponding plugin in Airflow so that you can clean your database using the command directly from the Airflow UI.
Even when using Airflow’s DB clean utilities, deleting data from the metadata database can destroy important data. Read the Warnings section carefully before implementing this tutorial Dag in any production Airflow environment.
Warnings
Deleting data from the metadata database can be an extremely destructive action. If you delete data that future task runs depend on, it’s difficult to restore the database to its previous state without interrupting your data pipelines. Before implementing the Dag in this tutorial, consider the following:
- When specifying the
clean_before_timestampvalue, use as old a date as possible. The older the deleted data, the less likely it is to affect your currently running Dags. - The Dag in this tutorial drops the archived tables it created in the cleanup process by default using the
skip_archive=Trueargument, and does not maintain any history. If the task fails (for example if it runs for longer than five minutes), the archive tables are not cleared. By callingdrop_archived_tablesin the second task of the Dag, we ensure all archive tables are dropped even in the event of the first task failing.
Prerequisites
-
An Airflow project
This Dag has been designed and optimized for Airflow environments running on Astro. Consider adjusting the parameters and code if you’re running the Dag in any other type of Airflow environment.
-
The HTTP Airflow provider installed
Step 1: Create your Dag and plugin
-
In your dags folder, create a file called
db_cleanup.py. -
Copy the following code into the Dag file.
Rather than running on a schedule, this Dag is triggered manually by default and includes params so that you’re in full control over how you clean the metadata DB.
It includes three tasks:
get_chunked_timestamps: creates a list of timestamps to process in batches.db_cleanup: calls therun_cleanuputility.clean_archive_tables: calls thedrop_archived_tablesutility.
These three tasks run with params you specify at runtime. The params let you specify:
clean_before_timestamp: What age of data to delete. Any data that was created before the specified time will be deleted. The default is to delete all data older than 90 days.tables: Which tables to delete data from. By default all tables supported by the DB cleanup utilities are included except for thedaganddag_versiontable.dry_run: Whether to run the cleanup as a dry run, meaning that no data is deleted. The dag will instead return the SQL that would be executed based on other parameters you have specified. The default is to run the deletion without a dry run.batch_size_days: What batch size to use in order to cleanup data in batches.http_conn_id: Which HTTP connection to use for calling the API exposing the DB cleanup utilities.
-
In your plugins folder, create a file called
db_cleanup.py. -
Copy the following code into the plugin file.
The plugin exposes the following API endpoints:
GET /db_cleanup/api/info: Provide a list of tables with their corresponding sizes and row count estimates. This endpoint is not used by the Dag, but can be useful to get insights into table sizes.GET /db_cleanup/api/oldest_timestamp: Return the oldest timestamp for the tables to cleanup used for calculating batches.DELETE /db_cleanup/api/records: Call therun_cleanuputility.DELETE /db_cleanup/api/archived: Call thedrop_archived_tablesutility.
Because the DB cleanup utilities are running on the api-server, the corresponding logs will show up in the api-server logs.
Step 2: Configure a HTTP connection
Add a HTTP connection used for calling the API endpoints.
host: Set this to the deployment’s URL. For example on Astro this would look like something likehttps://cmls9yey09fpw01ncvse41m4n.4n.astronomer.run/dse41m4n. When running locally inastro devthis should be set tohttp://api-server:8080.extra: If needed, set the authorization header. On Astro with an API token this would look something like{"Authorization": "Bearer mytoken1234...abc1234"}.
Step 3: Practice running the Dag
In this step, run the Dag in a local Airflow environment to practice the workflow for cleaning metadata DB records. If you completed Step 1 in your production environment, you will need to repeat it here before starting your local Airflow project. Typically in a fresh local Airflow environment there is not much to clean up. When completing this process in a production environment which has been running for a while, there are more historic records to cleanup.
-
Run
astro dev startin your Astro project to start Airflow, then open the Airflow UI atlocalhost:8080. -
Ensure the Airflow connection
http_defaultwith hosthttp://api-server:8080is set.Instead of creating an Airflow connection, you can also define it as an environment variable
AIRFLOW_CONN_HTTP_DEFAULT=http://api-server:8080in your local.envfile. -
In the Airflow UI, run the
astronomer_db_cleanupDag by clicking the play button and configure the following params:dry_runis enabled- Choose an appropriate cutoff date for
clean_before_timestamp
-
Click Trigger.
-
In a local terminal run
astro dev logs --api-server -fto show the api-server logs. -
Check that the
run_cleanuputility completed successfully. Note that if you created a new Astro project for this tutorial, the run will not show much data to be deleted.
You can now use this Dag to periodically clean data from the Airflow metadata DB as needed.