All applications are vulnerable to service-interrupting events - a network outage, a team member accidentally deleting a namespace, a critical bug introduced by your latest application push or even a natural disaster. All are rare and undesirable events that modern teams running enterprise-grade software need to protect against. At Astronomer, we encourage all customers to have a robust, targeted, and well-tested DR (Disaster Recovery) plan.
The doc below will provide guidelines for how to:
At Astronomer, we strongly recommend Velero for both backup and restore operations. Velero is an open source tool acquired by VMWare that is built for Kubernetes backups and migrations.
Unlike other tools that directly access the Kubernetes etcd database to perform backups and restores, Velero uses the Kubernetes API to capture the state of cluster resources and to restore them when necessary. This API-driven approach has a number of key benefits:
To recover the Astronomer platform in the case of an incident, back up the following resources in order of priority:
You should never back up Redis PVCs. Restoring Redis can result in conflicting Airflow and Celery task state information.
Read below for specific instructions for how to backup these components.
With Velero, you can back up or restore all objects in your cluster, or you can filter objects by type, namespace, and/or label. There are two types of backups:
Generally speaking, the backup operation does the following:
We’ll cover both on-demand and scheduled backups below. For more information, see How Velero Works
The following instructions assume you have:
kubectl access to your clusterIf you do not have Velero or the Velero CLI installed, see How Velero Works.
If you need to create a backup on demand, run the following in the Velero CLI:
By default, the command above makes disk snapshots of any persistent volumes. You can adjust the snapshots by specifying additional flags. To see available flags, run:
Snapshots can be disabled with the option --snapshot-volumes=false.
Production environments should have scheduled backups enabled. The frequency of this backup depends on your needs and constraints.
We recommend that you start with at least daily backups and adjust the frequency from there as needed. To schedule a backup for a specific time, run:
The command above will schedule a daily backup of the entire cluster at 1am UTC. Velero uses standard Unix cron syntax to specify the schedule frequency and occurrence.
You can use one of the following methods to backup the Astronomer database:
The easiest and most reliable way to ensure the database is backed up is to enable automatic backups with your cloud provider. This will create daily backups of your Astronomer Postgresql database.
Refer to the following links to Cloud Provider documentation for creating Postgres Database Backups:
Similar to Velero, one-off snapshots can also be created that will represent the database at that specific time, rather than at the normal scheduled intervals.
pg_dump):To run pg_dump successfully, someone with “read” access to the Astronomer Database will need to collect the following (stored as a Kubernetes Secret):
Run the following command to return the connection string with the username and password:
In the case of an incident, you’re always free to restore either:
The guidelines below will cover both, including specifics for restoring both deleted and non-deleted Airflow Deployments.
The steps below are valid for the Astronomer Platform on Helm3 (Astronomer v0.14+).
To restore a previous version of a deployment that has not been deleted in the Astronomer Software UI (or CLI/API) and that has been backed up with Velero, follow the steps below.
Identify the Velero backup you intend to use by running:
Identify the Kubernetes namespace in question, which corresponds to your Airflow Deployment’s “release name” and has your platform’s namespace (typically “astronomer”) prepended to the front.
For example, the namespace for an Airflow Deployment with the release name weightless-meteor-5042 would be astronomer-weightless-meteor-5042.
Run:
To restore a single Airflow Deployment that was deleted in the Astronomer Software UI (or CLI/API), perform the previous steps to restore its Velero namespace.
Once that is complete, the Astronomer Database needs to be updated to mark that release as not deleted. Follow the steps below.
Grab your database connection string (stored as a Kubernetes secret)
To connect to the database, launch a container into your cluster with the Postgres client:
Then run the following command to connect to the database:
Example:
Update the record for the deployment you wish to restore.
Following these steps, the restored Airflow Deployment should render in the Software UI with its corresponding Workspace. All associated pods should be running in the cluster.
In case your team ever needs to migrate to new infrastructure or your existing infrastructure is no longer accessible and you need to restore the Astronomer Platform in its entirety, including all Airflow Deployments within it, follow the steps below.
readonly to prevent accidentally overwriting any backups by running:From here,
Restore database snapshots to a new Postgres database or create a new database and restore from pg_dump backups.
Perform velero full cluster restore by running:
If the database endpoint has changed (e.g. it has a new hostname), it needs to be provided to the platform.
astronomer-bootstrap secret to have the new connection string. Then the pods in the astronomer namespace will need to be restarted to pick up this change. The pgbouncer-config secret in each release namespace will also need to be updated with the new endpoint in the connection string.pg-sqlproxy-gcloud-sqlproxy deployment needs to be updated to put the new database instance name in the instances argument passed to the container