Running DAGs and Scaling
Once you've created your deployment, you can configure it for the use case at hand.
The second half of the
Configure tab allows you to adjust your resource components - empowering you to freely scale your deployment up or down as you wish. To this end, you can:
- Choose your Executor (Local or Celery)
- Adjust resources to your Scheduler and Webserver
- Adjust worker count (Celery only)
- Adjust your
Worker Termination Grace Period(Celery only)
- Add Extra Capacity (Kubernetes only)
Components section, you can adjust the AU's (Astronomer Units of CPU and memory) you want to allocate towards your Scheduler, Webserver, and Celery Workers, if applicable.
If you're running Astronomer Enterprise, you can watch these in real time with your Grafana dashboards.
Airflow Executors 101
Check out this guide for a summary on each executor.
Which executor should I be using?
Generally speaking, we recommend the local executor for any "dev" environments and the Celery executor for any "production" environments.
The local executor will execute your dags in the same pod as the scheduler. If you are only running a few tasks light tasks a day that don't pull anything into memory, you might be able to get away with just the local executor. As you scale up the number of tasks or the resources your workflows require, we recommend moving over to Celery.
Regardless of which executor you are using, each task will run in a temporary container. No tasks will have access to the any locally stored file created by a separate task.
Scaling the Scheduler and Webserver.
If you are seeing delays in tasks being scheduled (check the Gantt Chart), it's usually a time to scale up your scheduler. You can also receive email alerts when your scheduler is underprovisioned (more on this in the Alerting section).
If your Airflow UI is really slow or crashes when you try to load a large DAG, you'll want to scale up your webserver.
Extra Capacity setting is tied to several dimensions related the KubernetesPodOperator and the Kubernetes Executor, as it maps to extra pods created in the cluster. Namely, the slider has an effect on (1) CPU and memory quotas and (2) database connection limits.
database connections shows how many actual connections to Astronomer's database (not yours) are actively being used whereas
client connections refers to all Airflow connections opened against the PgBouncer (a light-weight connection pool manager for Postgres) for a particular deployment. This will normally be a higher and more variable number.
Importantly, these connections do NOT have any impact on the way you write your DAGs or how many concurrent connections you hold to your own databases. It's really just about how the Webserver, Scheduler, and Workers connect to Astronomer's Postgres to update the state of variables, DAGs, tasks, etc. Unless you're implementing the Kubernetes Pod operator or the soon-to-come Kubernetes Executor, don't worry about it.
Environment Variables ("Env Vars") are a set of configurable values that allow you to dynamically fine tune your Airflow deployment - they encompass everything from email alerts to the number of tasks that can run at once (concurrency). They're traditionally defined in your
airflow.cfg, but you can now insert them directly via Astronomer's UI.
For a full list of Environment Variables you can configure, go here. Every environment variable you set will be stored as a Kubernetes secret in your deployment.
Note: Environment Variables are distinct from Airflow Variables/XComs, which you can configure directly via the Airflow UI/our CLI/your DAG code and are used for inter-task communication.