Version:

v0.10.0

Documentation

Data Management


Design Philosophy

Astronomer Enterprise ships with a collection of various microservices that run in groups of Kubernetes pods. The only thing not included with our platform by default is the Postgres instance that sits behind our platform API and various Airflow deployments that you will end up spinning up.

There are a few reasons for this.

  1. Managed Postgres Benefits: Although the user is able to decide whether or not they'd like to use managed or vanilla of Postgres, we strongly reccomend using a managed flavor. This can be one hosted by a Cloud provider (Google CloudSQL, Amazon RDS, etc.) or another managed Postgres solution that your organizations provides. Managed systems come out-of-the-box with a slew of benefits that result in improved availability, failover, and security.

  2. Backup: Keeping Postgres separate from the Kubernetes environment allows us to maintain an extra layer of stability in our platform; if there is some disaster that tanks your k8s cluster, the database will persist and the platform can be fully redeployed with its most recent configuration and metadata store rather than being re-installed from scratch.

Postgres Schema Structure

The Postgres instance will contain one database for the Astronomer API. The datamodel for this database is defined here using GraphQL SDL; each Type in that file represents an individual table in our Houston API schema. that file will always be up-to-date with the latest table structure.

Houston ERD

Additionally, each Airflow deployment that you create will be given a new and separate database in Postgres to use for its metadata. These databases will contain the stock Airflow schema. Any sensitive data entered through the Airflow UI will be encrypted using the standard Airflow fernet key mechanisms and stored in the appropriate databases.

Data Persistence

Airflow

All Airflow databases will persist until the deployment is deleted via the Astronomer UI or API. When the deployment is deleted by the user, the database will continue to persist for 15 days to allow a degree of restorability. After those 15 days pass, it is assumed that the user no longer needs access to that database and a Kubernetes Cron job takes care of hard deleting the database and registry tags assocated with that deployment.

Note that this 15 day timeframe is configurable via Helm.

Log and Metrics Data

Astronomer also puts the logging and high-level k8s metrics data to use via an EFK (Elasticsearch, Fluentd, Kibana) stack and a Prometheus/Grafana stack.

Logs generated by the Airflow deployments are scraped and stored in an Elasticsearch cluster for 15 days by default. We do restrict log access on a per deployment basis but we do not do anything to ensure this data is encrypted and urge our users to not log any sensitive data.

Metrics are stored in Prometheus and will contain no sensitive information. This data is also pruned after 15 days by default. This time window is also configurable via Helm.