Configure cleanup jobs

Configure automated cleanup jobs to maintain database health by removing old data. Astro Private Cloud (APC) includes several cleanup jobs that run as CronJobs on configurable schedules to manage storage growth and query performance.

Cleanup jobs summary

JobDefault ScheduleDefault RetentionPurpose
cleanupDeploymentsDaily @ 00:0014 daysRemoves soft-deleted deployments
cleanupDeployRevisionsDaily @ 23:1190 daysArchive deploy history
cleanupTaskUsageDataDaily @ 23:4090 daysPurge task metrics
cleanupClusterAuditsDaily @ 23:4990 daysRemove cluster audit logs
cleanupAirflowDbDaily @ 05:23365 daysClean Airflow metadata (disabled by default)

cleanupDeployments

Permanently removes deployments that have been soft-deleted after the retention period.

What gets cleaned

  • Deployment database records marked with deletedAt
  • Associated Docker registry images
  • Deployment metadata database

Configuration

1houston:
2 cleanupDeployments:
3 enabled: true
4 schedule: "0 0 * * *" # Midnight daily
5 olderThan: 14 # Days since deletion
6 dryRun: false # Set true to preview

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deployments --older-than=14 --dry-run=false

cleanupDeployRevisions

Removes old deployment revision records to reduce database size.

What gets cleaned

  • deployRevision records older than retention period
  • Historical deployment configuration snapshots

Configuration

1houston:
2 cleanupDeployRevisions:
3 enabled: true
4 schedule: "11 23 * * *" # 23:11 daily
5 olderThan: 90 # Days to retain

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deploy-revisions --older-than=90

Per-deployment cleanup

Run this command from a machine with access to the underlying Kubernetes cluster to clean revisions for a specific deployment:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-deploy-revisions --older-than=90 --deploymentUuid=<uuid>

cleanupTaskUsageData

Purges task usage metrics and audit logs.

What gets cleaned

  • TaskUsage records (daily aggregated metrics)
  • TaskUsageAuditLog records (raw task data)

Configuration

1houston:
2 cleanupTaskUsageData:
3 enabled: true
4 schedule: "40 23 * * *" # 23:40 daily
5 olderThan: 90 # Minimum 90 days
6 dryRun: false

Minimum retention is 90 days and cannot be reduced.

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-task-usage-data --older-than=90 --dry-run=false

GraphQL trigger

1query {
2 cleanupTaskUsageDataJob(olderThan: 90)
3}

cleanupClusterAudits

Removes cluster audit log entries.

What gets cleaned

  • ClusterAudit records tracking cluster configuration changes
  • Historical cluster state snapshots

Configuration

1houston:
2 cleanupClusterAudits:
3 enabled: true
4 schedule: "49 23 * * *" # 23:49 daily
5 olderThan: 90 # Days to retain

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-cluster-audit --older-than=90

Filter by cluster

Run this command from a machine with access to the underlying Kubernetes cluster to clean audits for specific clusters:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-cluster-audit --older-than=90 --cluster-ids=<id1>,<id2>

cleanupAirflowDb

Cleans Airflow metadata from individual Deployment databases.

This job is disabled by default due to potential impact on running Deployments.

What gets cleaned

Default tables:

  • callback_request - Task callback requests
  • celery_taskmeta, celery_tasksetmeta - Celery metadata
  • dag - Dag definitions
  • dag_run - Dag execution history
  • dataset_event - Dataset events
  • import_error - Import errors
  • job - Job records
  • log - Task execution logs
  • session - Session data
  • sla_miss - SLA violations
  • task_fail - Task failures
  • task_instance - Task execution records
  • task_reschedule - Reschedule events
  • trigger - Trigger records
  • xcom - Cross-communication data

Configuration

1houston:
2 cleanupAirflowDb:
3 enabled: false # Must explicitly enable
4 schedule: "23 5 * * *" # 05:23 daily
5 olderThan: 365 # Days to retain
6 outputPath: "/tmp" # Archive location
7 dropArchives: true # Delete after archiving
8 dryRun: false
9 provider: local # Storage: local/aws/azure/gcp
10 bucketName: "/tmp" # Cloud bucket or local path
11 tables: "" # Specific tables (empty = all)

Cloud storage export

Export archived data to cloud storage:

1houston:
2 cleanupAirflowDb:
3 enabled: true
4 provider: aws # aws, azure, or gcp
5 bucketName: "my-archive-bucket"
6 providerEnvSecretName: "aws-credentials-secret"

Specific tables only

Clean only specific tables:

1houston:
2 cleanupAirflowDb:
3 enabled: true
4 tables: "log,task_instance,xcom"

Manual trigger

Run this command from a machine with access to the underlying Kubernetes cluster:

$kubectl -n <namespace> exec -it deploy/<release-name>-houston -- yarn cleanup-airflow-db-data \
> --older-than=365 \
> --provider=local \
> --bucket-name=/tmp \
> --tables="log,task_instance"

Schedule reference

Default schedules are staggered to avoid simultaneous execution:

TimeJob
00:00cleanupDeployments
05:23cleanupAirflowDb
23:11cleanupDeployRevisions
23:40cleanupTaskUsageData
23:49cleanupClusterAudits

Common configuration options

All cleanup jobs share these options:

1houston:
2 cleanup<JobName>:
3 enabled: true/false # Enable/disable the job
4 schedule: "cron-expression" # When to run
5 olderThan: <days> # Retention period
6 dryRun: false # Preview without deleting
7 readinessProbe: {} # Optional health probes
8 livenessProbe: {}

Kubernetes CronJob behavior

All cleanup CronJobs use:

  • Concurrency policy: Forbid (prevents overlapping runs)
  • Backoff limit: 1 retry on failure
  • Restart policy: Never

Monitor cleanup jobs

Check job status

$# List all cleanup CronJobs
$kubectl get cronjobs -n astronomer | grep cleanup
$
$# View recent job runs
$kubectl get jobs -n astronomer | grep cleanup
$
$# Check job logs
$kubectl logs job/<job-name> -n astronomer
$
$# Trigger individual jobs manually
$kubectl create job --from=cronjobs/jobname jobname-hash -n astronomer

Verify data cleanup

1-- Check remaining records by date
2SELECT DATE(created_at), COUNT(*)
3FROM deploy_revision
4GROUP BY DATE(created_at)
5ORDER BY DATE(created_at) DESC;

Troubleshooting

Job not running

  1. Check CronJob exists:

    $kubectl get cronjob houston-cleanup-deployments -n astronomer
  2. Check job is enabled in Helm values

  3. Verify schedule syntax is valid cron expression

Job failing

  1. Check job logs:

    $kubectl logs job/houston-cleanup-deployments-<timestamp> -n astronomer
  2. Database connectivity: Ensure Houston can reach the database

  3. Permissions: Verify service account has required database permissions

Data not being cleaned

  1. Check retention period: Data younger than olderThan won’t be deleted
  2. Verify timestamps: Check createdAt/deletedAt values in database
  3. Run with dry-run: Preview what would be deleted

Best practices

  1. Monitor database size before and after cleanup jobs
  2. Start with dry-run when adjusting retention periods
  3. Stagger schedules if adding custom cleanup jobs
  4. Archive before delete for cleanupAirflowDb in production
  5. Set alerts for failed cleanup jobs