Astronomer Software disaster recovery
All applications are vulnerable to service-interrupting events - a network outage, a team member accidentally deleting a namespace, a critical bug introduced by your latest application push or even a natural disaster. All are rare and undesirable events that modern teams running enterprise-grade software need to protect against. At Astronomer, we encourage all customers to have a robust, targeted, and well-tested DR (Disaster Recovery) plan.
The doc below will provide guidelines for how to:
- Backup the Astronomer Platform
- Restore the Astronomer Platform in case of an incident
At Astronomer, we strongly recommend Velero for both backup and restore operations. Velero is an open source tool acquired by VMWare that is built for Kubernetes backups and migrations.
Why Velero
Unlike other tools that directly access the Kubernetes etcd database to perform backups and restores, Velero uses the Kubernetes API to capture the state of cluster resources and to restore them when necessary. This API-driven approach has a number of key benefits:
- Backups can capture subsets of the cluster’s resources, filtering by namespace, resource type, and/or label selector, providing a high degree of flexibility around what’s backed up and restored.
- Users of managed Kubernetes offerings often do not have access to the underlying etcd database, so direct backups/restores of it are not possible.
- Resources exposed through aggregated API servers can easily be backed up and restored even if they’re stored in a separate etcd database.
Backup
To recover the Astronomer platform in the case of an incident, there are two main components that need to be backed up:
- The Kubernetes cluster state
- The Astronomer Postgres database
Read below for specific instructions for how to backup both components.
Kubernetes cluster backup
With Velero, you can back up or restore all objects in your cluster, or you can filter objects by type, namespace, and/or label. There are two types of backups:
- On-Demand
- Scheduled
Generally speaking, the backup operation does the following:
- Uploads a tarball of copied Kubernetes objects into cloud object storage.
- Calls the cloud provider API to make disk snapshots of persistent volumes, if specified.
We’ll cover both on-demand and scheduled backups below. For more information on the above, refer to “How Velero Works.”
Prerequisites
The following instructions assume you have:
- Velero installed in your cluster
- The Velero CLI
kubectl
access to your cluster
If you do not already have both, reference Velero's documentation.
On-demand backup
If you need to create a backup on demand, run the following via the Velero CLI:
velero backup create <BACKUP NAME>
By default, the command above makes disk snapshots of any persistent volumes. You can adjust the snapshots by specifying additional flags. To see available flags, run:
velero backup create --help
Snapshots can be disabled with the option --snapshot-volumes=false.
Scheduled backup
Production environments should have scheduled backups enabled. The frequency of this backup depends on your needs and constraints.
We recommend that you start with at least daily backups and adjust the frequency from there as needed. To schedule a backup for a specific time, run:
velero schedule create <SCHEDULE NAME> --schedule "0 1 * * *"
The command above will schedule a daily backup of the entire cluster at 1am UTC. Velero uses standard Unix cron syntax to specify the schedule frequency and occurrence.
Database backup
There are two ways to backup the Astronomer Database:
- Enable Automatic Backups via your Cloud Provider (Preferred)
- Traditional Backup Tools (e.g. pg_dump for Postgresql)
Enable automatic backups with your cloud provider
The easiest and most reliable way to ensure the database is backed up is to enable automatic backups via your cloud provider. This will create daily backups of your Astronomer Postgresql database.
Refer to the following links to Cloud Provider documentation for creating Postgres Database Backups:
- AWS: Database backup and restore in AWS
- GCP: Create automatic backups in GCP
- Azure: Azure Postgres backup and restore
Similar to Velero, one-off snapshots can also be created that will represent the database at that specific time, rather than at the normal scheduled intervals.
Traditional backup tool (pg_dump
):
To run pg_dump
successfully, someone with “read” access to the Astronomer Database will need to collect the following (stored as a Kubernetes Secret):
- Database connection string
- Username
- Password
The connection string, which includes username and password, can be obtained via the following command:
kubectl -n astronomer get secret astronomer-houston-backend -o jsonpath='{.data.connection}' | base64 -D
Restore
In the case of an incident, you’re always free to restore either:
- A single Airflow Deployment
- The Whole Platform (all Airflow Deployments)
The guidelines below will cover both, including specifics for restoring both deleted and non-deleted Airflow Deployments.
Single Deployment
The steps below are valid for the Astronomer Platform on Helm3 (Astronomer v0.14+).
Non-deleted Airflow Deployment
To restore a previous version of a deployment that has NOT been deleted via the Software UI (or CLI/API) and that has been backed up with Velero, follow the steps below.
-
Identify the Velero backup you intend to use by running:
velero backup get
-
Identify the Kubernetes namespace in question, which corresponds to your Airflow Deployment’s “release name” and has your platform’s namespace (typically “astronomer”) prepended to the front.
For example, the namespace for an Airflow Deployment with the release name
weightless-meteor-5042
would beastronomer-weightless-meteor-5042
. -
Run:
velero restore create --from-backup <BACKUP NAME> --include-namespaces <NAMESPACE NAME>
Deleted Airflow Deployment
To restore a single Airflow Deployment that was deleted via the Software UI (or CLI/API), first perform the steps detailed above for restoring its namespace with Velero.
Once that is complete, the Astronomer Database needs to be updated to mark that release as not deleted. Follow the steps below.
-
Grab your database connection string (stored as a Kubernetes secret)
kubectl -n astronomer get secret astronomer-houston-backend -o jsonpath='{.data.connection}' | base64 -D
-
To connect to the database, launch a container into your cluster with the Postgres client:
kubectl run pgclient -ti --image=postgres --restart=Never --image-pull-policy=Always -- bash
-
Then run the following command to connect to the database:
psql <YOUR CONNECTION STRING>
Example:
psql postgres://airflow:XXXXXXX@database1.cloud.com:5432/astronomer_houston
-
Update the record for the deployment you wish to restore.
UPDATE houston$default."Deployment" SET "deletedAt" = NULL WHERE "releaseName" = '<YOUR RELEASE NAME>';
Following these steps, the restored Airflow Deployment should render in the Software UI with its corresponding Workspace. All associated pods should be running in the cluster.
Whole platform
In case your team ever needs to migrate to new infrastructure or your existing infrastructure is no longer accessible and you need to restore the Astronomer Platform in its entirety, including all Airflow Deployments within it, follow the steps below.
- Create a new Kubernetes cluster without Astronomer installed
- Install Velero into the new cluster, ensuring that it can reach the previous backups in their storage location (e.g. S3 storage or GCS Bucket)
- Set the Velero backup storage location to
readonly
to prevent accidentally overwriting any backups by running:
kubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
--namespace velero \
--type merge \
--patch '{"spec":{"accessMode":"ReadOnly"}}'
From here,
-
Restore database snapshots to a new Postgres database or create a new database and restore from
pg_dump
backups. -
Perform velero full cluster restore by running:
velero restore create --from-backup <BACKUP NAME>
-
If the database endpoint has changed (e.g. it has a new hostname), it needs to be provided to the platform.
- AWS - Update the
astronomer-bootstrap
secret to have the new connection string. Then the pods in the astronomer namespace will need to be restarted to pick up this change. Thepgbouncer-config
secret in each release namespace will also need to be updated with the new endpoint in the connection string. - GCP - The
pg-sqlproxy-gcloud-sqlproxy
deployment needs to be updated to put the new database instance name in theinstances
argument passed to the container
- AWS - Update the