Install Astronomer Certified on a virtual machine
The Astronomer Certified Python wheel is a distribution of Apache Airflow maintained by Astronomer. While functionally identical to Apache Airflow, the Astronomer Certified Python wheel includes additional bug and security fixes, as well as extended support from the Astronomer team.
If you want to run Astronomer’s distribution of Airflow without using Docker, you can install the Astronomer Certified Python wheel on either one or many virtual machines.
This guide provides steps for installing the Python wheel onto a single virtual machine. By the end of the setup, you’ll have a simple development environment for Airflow running on your local machine.
Note that this setup represents one possible configuration of Astronomer Certified and uses optional tools such as systemd and PostgreSQL. After successfully starting Airflow for the first time, we recommend reviewing this configuration and adjusting it based on the functional requirements for your project.
Prerequisites
The machines where you install Astronomer Certified must be Debian-based. CentOS and Windows Server are currently unsupported.
Once you’ve decided which machine you’ll be installing Astronomer Certified on, ensure that the following OS-level packages are installed on the machine:
- sudo
- python3
- python3-dev
- python3-venv
- python3-psycopg2
- gcc
- postgresql
- systemd
If you’re running on a Debian-based OS, you can install these with the following command:
You also need a database on the machine that will run your Airflow instance. This guide walks through the process for configuring a PostgreSQL database, which is our recommended implementation, but Airflow is compatible with all of the following databases:
- PostgreSQL: 9.6, 10, 11, 12, 13
- MySQL: 5.7, 8
- SQLite: 3.15.0+
Lastly, this guide assumes that you are installing Airflow 2.0+. The differences for installing pre-2.0 versions of Airflow are noted throughout the guide.
Step 1: Set Up Airflow’s metadata database
In Airflow, the metadata database is responsible for keeping a record of all tasks across DAGs and their corresponding status (queued, scheduled, running, success, failed, etc). To set up the metadata database:
-
Create a database user named
airflow
:This will prompt you for a password. Create one, and make a note of it for later.
-
Create a database named
airflow
and set theairflow
user as the owner:
This guide assumes that your database server is local to where you run these commands and that you’re on a Debian-like OS. If your setup is different, you will need to tweak these commands.
/var/lib/postgresql/data/pg_hba.conf
file and restart Postgres. Editing this file will vary for each individual database setup. Before editing this file, consider the security implications for your team.If you’d like to use an existing PostgreSQL database instead of creating a new one, you can do so as long as both of the following are true:
- The database is compatible with Airflow as described in Prerequisites.
- A user named
airflow
has ownership access to the database.
When you specify the AIRFLOW__CORE__SQL_ALCHEMY_CONN
environment variable in step 2F, replace the connection string with one that corresponds to your database.
Step 2: Create a system user to run Airflow
Airflow can run as any user, but for this setup we configure a new user called astro
. Run the following command to add this user to your machine:
Step 3: Create an Astro project directory
You also need to configure an AIRFLOW_HOME
directory (not to be confused with the user’s home directory) where you’ll store your DAGs and other necessary files. We recommend using the path /usr/local/airflow
as your project directory and /usr/local/airflow/dags
as your DAG directory, but any path can be chosen as long as the astro
user has write access to it. To do this, run the following commands:
Step 4: Create a virtual environment
To isolate your Airflow components from changes to the system, create a virtual environment in a directory named astro/airflow-venv
using the following command:
venv is a tool to create lightweight, isolated Python environments without affecting systemwide configuration. For more information, read Python’s venv documentation.
Step 5: Install Astronomer Certified
To install the AC Python wheel onto your machine, run one of the following commands depending on your chosen Airflow Version and Executor:
-
For Local Executor:
-
For Celery Executor:
For example, to install the latest patch version of Apache Airflow 2.0.1 with support for the Celery executor, this command would be:
This command includes the optional postgres
, celery
, and redis
dependencies so that all libraries for those tools are also installed. If your environment requires extra functionality, specify additional dependencies in a comma-delimited list:
For a list of all optional dependencies, refer to the AC pip index.
Step 6: Configure a process supervisor
To ensure that Airflow is always running when your machine is on, we recommend implementing a process supervisor. Systemd is used in this example, though any process supervisor works here.
To use systemd as a process supervisor:
-
Create a systemd unit file using the following command:
-
Using a text editor, create and edit a file at
${AIRFLOW_HOME}/sys-config
to contain these environment variables and values:If you want to configure environment variables for a single Airflow service, we recommend doing so in the
sys-config
file for the machine on which the service is running. -
Add the following to your systemd unit file:
Step 7: Configure Airflow for database access
To connect your Airflow environment to the metadata database you created in Step 1, add the following environment variables to your sys-config
file depending on your chosen executor:
-
For Local Executor:
-
For Celery Executor:
The password you specify here should be the same one you specified when prompted by the createuser
command in Step 1. If your password contains %
, /
, or @
then you will need to url-escape; replace %
with %25
, /
with %2F
, and @
with %40
.
When you’ve finished configuring environment variables, run the following command to add your environment variables to your astro
user’s shell environment:
Optional: Configure a secret backend for database access
Your Airflow user password is stored in your sys-config
file (owned by root:root
and 0600
permissions) on your nodes. If you’d rather use an existing credential store, such as HashiCorp Vault, you can instead specify a command to obtain the connection string when the service starts up. For example:
Step 8: Set up the scheduler
In Airflow, the scheduler is responsible for reading from the metadata database to check on the status of each task and decides the order in which tasks should be completed. To get your scheduler running:
-
Enable the scheduler by running the following command:
-
Edit the override file for the machine by running the following command:
-
In the override file, add the following lines:
Note If you’re running Airflow 1.10, the command specified here will instead be
airflow upgradedb
. -
Start the service by running:
Step 9: Set up the webserver
The webserver is a core Airflow component that is responsible for rendering the Airflow UI. To configure it on its own machine, follow the steps below.
-
Enable the webserver by running the following:
-
Start the webserver by running the following:
Step 10: Set up workers (Celery only)
Workers are an essential component for running Airflow with the Celery executor. To set up Celery workers on your machine:
-
Create a new systemd unit file specifically for your Celery workers by running the following command:
-
In the unit file, add the following lines:
-
Enable the worker service by running the following command:
-
Start the service by running the following command:
Step 11: Create an Airflow user
To log in to the Airflow UI, you need to first create an Airflow user:
-
Switch to your system
astro
user using the following command:All Airflow CLI commands must be run from your
astro
user. -
Create a new
admin
Airflow user with the following command:
Step 12: Confirm the installation
To confirm that you successfully installed Apache Airflow, open http://localhost:8080
in your web browser. You should see the login screen for the Airflow UI.
Log in with your admin
user. From there, you should see Airflow’s primary ‘DAGs’ view: