1. What are Astronomer Providers?
Astronomer Providers are a new set of open source providers created and maintained by Astronomer to support workloads and use cases that benefit from running asynchronously. These Apache 2-licensed providers are built for compatibility with OSS Airflow and will be supported and maintained long-term by Astronomer. All Astronomer Providers will soon be available on the Astronomer registry.
2. What is an Async Operator?
An async operator (also known as a deferrable operator) is an operator or a sensor that creates efficiencies in the utilization of Airflow worker slots. Unlike normal operators and sensors, which take up a full worker slot for the entire time they are running (even if they are idle), async operators have the ability to vacate the worker slot and free the worker to complete other tasks.
Example: If you have 100 worker slots available to run tasks and you have 100 DAGs waiting on a Sensor that’s currently running, but idle, you will not be able to run anything else — even though your entire Airflow cluster is idle. Async operators help solve this problem by deferring the task in the worker slot, making it available for other tasks. The async operator is also equipped with a callback function that resumes the deferred task. This process of deferral and resumption is managed by the triggerer, an Airflow component run separately from the scheduler; more detail on what this process can and cannot do is explained in the video.
Below is an example structure of an async operator:
3. Why Use Async Operators and Sensors?
- Reduced resource consumption: Using async operators and sensors reduces the number of workers needed to run tasks during periods of high concurrency. This allows your Airflow cluster to waste fewer resources on idle tasks, resulting in massive savings, especially on long-running operators.
- Resiliency against restarts: Deferred tasks will not be set to a failure state if a trigger process needs to be restarted due to a deployment or infrastructure issue.
4. When Should You Make an Operator or Sensor Async?
You can create async operators for the “sync-version” of operators that take more than a few minutes to complete.
For example, you won’t create an async operator for a
BigQueryCreateEmptyTableOperator, which should run quite quickly, but you will create one for
BigQueryInsertJobOperator, which actually runs queries and can take hours (in the worst-case scenario) for task completion.
Some example use cases:
- File system-based operations
- Network-backed operations
- Time-consuming tasks that can be executed async
- DB-based operations like executing a long-running query in async operators
- Poke operations
5. Best Practices
When considering using Astronomer Providers, we recommend taking the following best practices into consideration.
- Ensure you are using Python 3.7 or higher as async, operators and triggers rely on more recent asyncio features.
- Ensure your Airflow is installed with a minimum of one triggerer and one scheduler.
- Check if the official library supports async calls, and if not, find a third-party library that does.
- Consider that the async version of the operator should be easily swappable, and no DAG-facing changes should be required — apart from changing import paths.
- Write an async hook that the operator will use.
- Remember that your operator must defer itself with a trigger.
- No state will persist automatically. You need to pass certain kwargs to persist state.
- Whenever a trigger performs a blocking operation, it has to be an asynchronous function wrapped with async and await.
- There are no sync operations inside the triggerer.
- You need to pass the params whatever information is required for the trigger process to run.