Clean up and delete task metadata from Airflow DB

The Houston API graphql query, cleanupAirflowDb, triggers the Airflow metadata cleanup job. You can run a cleanup job to automatically delete task and DAG metadata from your Deployment. This job runs an Astronomer custom cleanup script for all of your Deployments and exports the results in a CSV-formatted file structure to your configured external storage service.

You can enable this feature by setting the config flag in astronomer.houston.cleanupAirflowDb.enabled to true in your values.yaml file.

There are two ways to use this feature:

Scheduled Cleanup: You can configure a Kubernetes CronJob to run the cleanup job at regular intervals by defining the schedule and job parameters in the astronomer.houston.cleanupAirflowDb section of your values.yaml file.
Manual Cleanup: The Houston API GraphQL query, cleanupAirflowDb, manually triggers the Airflow metadata cleanup job for immediate execution.

The cleanup job deletes any data that’s older than the number of days specified in your olderThan configuration. Ensure that none of your historical data is required to run current DAGs or tasks before enabling this feature.

Prerequisites

System admin user privileges
External storage credentials that allow read/write permissions to your storage
(AWS Cloud Provider) The AWS CLI

Step 1: Configure your external storage credentials

Google Cloud Storage

AWS

You must provision a GCP Service Account with appropriate read/write permissions to your bucket. Export these credentials as a JSON file.
Create a Kubernetes secret in your Astronomer platform namespace with a name such as astronomer-gcs-keyfile. Then, run the following commands to update your environment:

$ kubectl annotate secret astronomer-gcs-keyfile "astronomer.io/commander-sync"="platform=astronomer"
> 
> kubectl run job --from=cronjob/astronomer-config-syncer runconfigsyncer-job-001

You use this Kubernetes secret to configure providerEnvSecretName when you configure the cleanup job and env.name when you set the storage provider secret.

(Optional) Configure a connection ID

If you want to run jobs for specific Deployments or within a Workspace or run manually triggered jobs using an API query, you can choose to configure an Airflow connection to your external storage service so that it can be stored as an environment variable. You must use the service account credentials to authenticate to your service when configuring your connection.

Google Cloud Storage

AWS

You must provision a GCP Service Account with appropriate read/write permissions to your bucket. Export these credentials as a JSON file.
Create an Airflow connection using these credentials. See Airflow documentation to learn how to configure your connection.

This strategy is not secure because the secret is in base64 encoded format in your config.yaml, which can be decoded.

You can use this connection as your connectionId when you make API queries as the cleanupjob trigger, but it is not required.

Step 3: Configure the cleanup job

The cronjob configuration provides the default values that your cleanup job uses whether you run a scheduled or manual cleanup job.

The following example shows the automatic cleanup job configuration that runs at 5:23AM and cleans up Deployments that are more than one year old.

1 astronomer:
2   houston:
3     cleanupAirflowDb:
4       # Enable cleanup CronJob
5       enabled: true
6 
7       # Default run is at 5:23 every morning https://crontab.guru/#23_5_*_*_*
8       schedule: "23 5 * * *"
9 
10       # Cleanup deployments older than this many days
11       olderThan: 365
12 
13       # Output path of archived data csv export
14       outputPath: "/tmp"
15 
16       # Delete archived tables
17       dropArchives: true
18 
19       # If set true, prints out the deployments that should be cleaned up and skip actual cleanup
20       dryRun: false
21 
22       # Name of file storage provider, supported providers - gcp/azure/aws/local
23       provider: <gcp/azure/aws/local>
24 
25       # Name of the provider bucket name / local file path
26       bucketName: "/tmp"
27 
28       # The name of the Kubernetes Secret containing your cloud provider connection secret
29       providerEnvSecretName: "<your-secret-name>"
30 
31       # Run cleanup on specific table or list of tables in a comma separated format
32       tables: "callback_request,celery_taskmeta,celery_tasksetmeta,dag,dag_run,dataset_event,import_error,job,log,session,sla_miss,task_fail,task_instance,task_reschedule,trigger,xcom"

Step 4: Set the storage provider secret

In the Houston config section of your values.yaml file, set the storage provider secret that you configured in Step 1, so that the cleanup job can export your cleanup results to your cloud storage.

You can configure the task metadata cleanup in different sections of the Helm chart, depending on your scope and use case. However, you can’t have cleanupAirflowDb.enabled: true enabled at multiple levels. You can only have the job enabled one of the three scope levels.

The env.name value includes the must match the secret name that you configured for providerEnvSecretName in your values.yaml file.

Deprecated Configuration: Using the Scheduler

Setting the provider secret in the Scheduler is the only method for configuring the secret for versions 0.37.0-0.37.3.

Google Cloud Storage

AWS

1 astronomer:
2   houston:
3     deployments:
4       helm:
5         airflow:
6           scheduler:
7             extraVolumes:
8               - name: dbcleanup
9                 secret:
10                   defaultMode: 420
11                   optional: true
12                   secretName: astronomer-gcs-keyfile
13             extraVolumeMounts:
14               - mountPath: /tmp/creds/astronomer-gcs-keyfile
15                 name: dbcleanup
16                 subPath: astronomer-gcs-keyfile
17             env:
18               - name: GCP_PASS
19                 value: /tmp/creds/astronomer-gcs-keyfile

Configure the provider secret in Houston

Google Cloud Storage

AWS

1 astronomer:
2   houston:
3     cleanupAirflowDb:
4       enabled: true
5       extraVolumes:
6         - name: dbcleanup
7           secret:
8             defaultMode: 420
9             optional: true
10             secretName: astronomer-gcs-keyfile
11       extraVolumeMounts:
12         - mountPath: /tmp/creds/astronomer-gcs-keyfile
13             name: dbcleanup
14           readOnly: false
15           subPath: astronomer-gcs-keyfile
16       extraEnv:
17         - name: GCP_PASS
18           value: /tmp/creds/astronomer-gcs-keyfile

Configure the storage provider secret in a Deployment

Google Cloud Storage

AWS

1 astronomer:
2   houston:
3    deployments:
4       cleanupAirflowDb:
5         enabled: true
6         extraVolumes:
7           - name: dbcleanup
8             secret:
9               defaultMode: 420
10               optional: true
11               secretName: astronomer-gcs-keyfile
12         extraVolumeMounts:
13           - mountPath: /tmp/creds/astronomer-gcs-keyfile
14               name: dbcleanup
15             readOnly: false
16             subPath: astronomer-gcs-keyfile
17         extraEnv:
18           - name: GCP_PASS
19             value: /tmp/creds/astronomer-gcs-keyfile

Step 5: (Optional) Set container CPU and memory limits or requests

You can set limits and requests for CPU and Memory of the cleanup container by adding the following to your cleanupAirflowDb configuration. These configurations become the new defaults for your cleanup job if you do not pass any additional configurations in your GraphQL mutation. Additionally, if you don’t use the manual trigger and instead use the cleanup cronjob, these resources also become the new default used when scheduling cleanup jobs.

1 cleanupAirflowDb:
2   resources:
3     requests:
4       cpu: 200m
5       memory: 768Mi
6     limits:
7       cpu: 500m
8       memory: 1536Mi

You can override these resource definitions, or configure resources if you do not define any, by using resourceSpec in an API query. See Scenario 4: Configure custom Pod Resources.

Step 6: Apply your configuration

Apply your platform configuration changes to enable cleanup jobs and to set your cronjob schedule.

$ helm upgrade <your-platform-release-name> astronomer/astronomer -f <your-updated-config-yaml-file> -n <your-platform-namespace> --set astronomer.houston.upgradeDeployments.enabled=false

If you want to upgrade all Deployments while updating your configuration, you can set astronomer.houston.upgradeDeployments.enabled to true.

Step 7: (Optional) Manually trigger the cleanup job

The following configuration enables you to trigger a cleanup job manually using a Houston API query. When you use the cleanup job in this way, the values you include in the query are used instead of the defaults set in the values.yaml configuration. This means you must specify the Deployment or Workspace in your query that you want to clean up.

1 astronomer:
2   houston:
3     config:
4       deployments:
5         cleanupAirflowDb:
6           enabled: true

Restrict cleanup to manual-only triggers

In Step 3, you set an automatic schedule for your platform to clean up task metadata by setting the astronomer.houston.cleanupAirflowDb.enabled configuration to true. To enable only triggering cleanup jobs manually, you must instead set astronomer.houston.cleanupAirflowDb.enabled to false. Manually triggered cleanup jobs require you to use a Houston API query and specify the Deployments where you want to archive metadata.

The following examples shows different mutations that you can use depending on your needs. See Houston API examples for all examples and scenarios that you can use to work with the Houston API.

Houston API Parameters

Name	Type	Description
`olderThan`	Int	Clean up data in Deployments that are older than the number of days defined in this parameter.
`dryRun`	Bool	When set to `true`, the job does not make changes, it only logs which data would be cleaned up. If unspecified, default is `false`.
`outputPath`	String	The path in your storage bucket or local storage where the job saves the archived CSV data.
`dropArchives`	Bool	If `true`, deletes any previously archived tables after export. Use with caution. Set to `false` by default.
`provider`	String	The cloud provider you use for archiving. Supported values: `aws`, `azure`, `gcp`, `local`. If unspecified, defaults to `local`.
`bucketName`	String	Name of the cloud storage bucket or local directory where the job saves the archive CSV export.
`providerEnvSecretName`	String	Name of the Kubernetes Secret that contains the credentials or config for the storage provider, if you used a Kubernetes secret.
`deploymentIds`	String	List of the specific Deployment IDs to target for cleanup.
`workspaceId`	String	Restricts cleanup to Deployments within the configured Workspace.
`tables`	String	Comma-separated list of tables to target for cleanup. If you do not configure this parameter, all supported tables will be cleaned.
`resourceSpec`	JSON	(Optional configuration) A JSON object that allows you to define a Pod resource configuration.
`connectionId`	String	(Optional configuration) Airflow connection ID used for accessing the underlying data warehouse. Can be left empty if no connection is defined.

Set dryRun: true to test this feature without deleting any data. When dry runs are enabled, the cleanup job will only print the data that it plans to modify in the serial output of the webserver Pod. To view the dryRun events of the cleanup job, check the logs of your webserver Pod for each Deployment.

Scenario 1: Cleanup Deployments per Workspace

You can use the following query to clean up Deployments in a specific Workspace. Configure the workspaceId parameter with the Workspace whose Deployments you want to clean up. You can also find Workspace IDs with the sysWorkspaces Houston API query.

1 query cleanupAirflowDb(
2   $olderThan: Int!
3   $dryRun: Boolean!
4   $outputPath: String!
5   $dropArchives: Boolean!
6   $provider: String!
7   $bucketName: String!
8   $providerEnvSecretName: String!
9   $deploymentIds: [Id]
10   $workspaceId: Uuid
11   $tables: String!
12   $connectionId: String
13 ) {
14   cleanupAirflowDb(
15     olderThan: $olderThan
16     dryRun: $dryRun
17     outputPath: $outputPath
18     dropArchives: $dropArchives
19     provider: $provider
20     bucketName: $bucketName
21     providerEnvSecretName: $providerEnvSecretName
22     workspaceId: $workspaceId
23     tables: $tables
24     connectionId: $connectionId
25   )
26 }

The following example shows some configured query variables to clean up all Deployments older than 1 day within the Workspace, cma40n66l000008l89nye86o1, that uses GCP as a cloud provider.

1 {
2 	"olderThan": 1,
3 	"dryRun": true,
4 	"outputPath": "",
5 	"dropArchives": true,
6 	"provider":  "gcp",
7 	"bucketName" : "",
8 	"connectionId":  "",
9 	"tables": "callback_request,celery_taskmeta,celery_tasksetmeta,dag,dag_run,dataset_event,import_error,job,log,session,sla_miss,task_fail,task_instance,task_reschedule,trigger,xcom",
10 	"providerEnvSecretName": "GCP_PASS",
11 	"workspaceId": "cma40n66l000008l89nye86o1"
12 }

Scenario 2: Cleanup Deployments across your system

You can use the following query to clean up specific Deployments in a specific workspace. Configure the workspaceId parameter with the Workspace and the deploymentIds with the specific Deployments.. You can also find Workspace IDs with the sysWorkspaces Houston API query.

1 query cleanupAirflowDb(
2   $olderThan: Int!
3   $dryRun: Boolean!
4   $outputPath: String!
5   $dropArchives: Boolean!
6   $provider: String!
7   $bucketName: String!
8   $providerEnvSecretName: String!
9   $deploymentIds: [Id]
10   $workspaceId: Uuid
11   $tables: String!
12   $connectionId: String
13 ) {
14   cleanupAirflowDb(
15     olderThan: $olderThan
16     dryRun: $dryRun
17     outputPath: $outputPath
18     dropArchives: $dropArchives
19     provider: $provider
20     bucketName: $bucketName
21     providerEnvSecretName: $providerEnvSecretName
22     deploymentIds: [Id]
23     tables: $tables
24     connectionId: $connectionId
25   )
26 }

The following example shows some configured query variables to clean up the specified Deployments older than 1 day within the Workspace, cma42z570000008l8f6rpc72f.

1 {
2 	"olderThan": 1,
3 	"dryRun": true,
4 	"outputPath": "",
5 	"dropArchives": true,
6 	"provider":  "gcp",
7 	"bucketName" : "",
8 	"connectionId":  "",
9 	"tables": "callback_request,celery_taskmeta,celery_tasksetmeta,dag,dag_run,dataset_event,import_error,job,log,session,sla_miss,task_fail,task_instance,task_reschedule,trigger,xcom",
10 	"providerEnvSecretName": "GCP_PASS",
11 	"deploymentIds": ["cma42zc67000108l89eb37iy5","cma42zjdp000208l8g16ygm6m"]
12 	"workspaceId": "cma42z570000008l8f6rpc72f"
13 }

Scenario 3: Clean up Deployment using Airflow connection ID

Requires configuring an Airflow Connection ID, connectionId, from the Airflow UI or CLI.

You can also both clean up all Deployments in a Workspace or specific Deployments and export the clean up logs to a storage provider configured in an Airflow Connection.

1 query cleanupAirflowDb(
2   $olderThan: Int!
3   $dryRun: Boolean!
4   $outputPath: String!
5   $dropArchives: Boolean!
6   $provider: String!
7   $bucketName: String!
8   $providerEnvSecretName: String!
9   $deploymentIds: [Id]
10   $workspaceId: Uuid
11   $tables: String!
12   $connectionId: String
13 ) {
14   cleanupAirflowDb(
15     olderThan: $olderThan
16     dryRun: $dryRun
17     outputPath: $outputPath
18     dropArchives: $dropArchives
19     provider: $provider
20     bucketName: $bucketName
21     deploymentIds: [Id]
22     tables: $tables
23     connectionId: $connectionId
24   )
25 }

The following code example cleans up the specified Deployments that are more than 1 day old and exports the logs to the storage provider defined in the connectionId.

1 {
2 	"olderThan": 1,
3 	"dryRun": true,
4 	"outputPath": "",
5 	"dropArchives": true,
6 	"provider":  "gcp",
7 	"bucketName" : "",
8 	"connectionId":  "<airflow_connection_id>",
9 	"tables": "callback_request,celery_taskmeta,celery_tasksetmeta,dag,dag_run,dataset_event,import_error,job,log,session,sla_miss,task_fail,task_instance,task_reschedule,trigger,xcom",
10 	"deploymentIds": ["cm6q3jpn61741517mhonzgcgz7","cm6q3jpn61741517mhonzgcgz7"]
11 	"workspaceId": "cm5nj9wly007617iox80beute"
12 }

Scenario 4: Configure custom Pod resources

If you do not configure a specific default Pod CPU or memory resource amount, or if you want to override a configuration, you can make a GraphQL query to set a user-defined resource configuration.

1 query cleanupAirflowDb(
2     $olderThan: Int!
3     $dryRun: Boolean!
4     $outputPath: String!
5     $dropArchives: Boolean!
6     $provider: String!
7     $bucketName: String!
8     $providerEnvSecretName: String!
9     $tables: String!
10     $resourceSpec: JSON
11   ) {
12     cleanupAirflowDb(
13       olderThan: $olderThan
14       dryRun: $dryRun
15       outputPath: $outputPath
16       dropArchives: $dropArchives
17       provider: $provider
18       bucketName: $bucketName
19       providerEnvSecretName: $providerEnvSecretName
20       tables: $tables
21       resourceSpec: $resourceSpec
22     )
23   }

The following query parameters show an example for configuring the resource requests and limits for the Cleanup run.

1 {
2   "resourceSpec": {
3     "requests": {
4       "cpu": "100m",
5       "memory": "5000Mi"
6     },
7   "limits": {
8       "cpu": "100m",
9       "memory": "5000Mi"
10     }
11   },
12   "olderThan": 1,
13   "dryRun": false,
14   "outputPath": "/abc",
15   "dropArchives": false,
16   "provider": "aws",
17   "bucketName": "test",
18   "providerEnvSecretName": "test-secret",
19   "tables": "dag"
20 }

Access your cleanup logs

You can access your cleanup logs through the UI or with your Pod logs.

Pod Logs

You can access your Pod logs with vector sidecar logging or FluentD with <release-name>-meta-cleanup-job in the Airflow namespace.

UI access

Go to the Logs tab in your Deployments page and select the AirflowMetaCleanup tab to access the logs. Only supported for FluentD at this time.