Remote Execution Agent failure and recovery scenarios
When the heartbeat between the API server and a Remote Execution Agent is disrupted, the Astro executor prevents task duplication by marking queued tasks from that agent as failed. This makes tasks eligible for reassignment to healthy agents. To ensure safe task execution, an agent must receive explicit confirmation from the API server before starting any task. If an agent loses connectivity with the API server, the agent continues executing any tasks that the API server already confirmed and marked as running, but the agent will not start new tasks until heartbeat communication is restored.
Agent failure
The API Server marks an agent as failed if the API server misses three consecutive heartbeat intervals. When that happens, the API server checks whether the Agent has any “queued” tasks, or tasks the agent already picked up and started running, but has not yet reported as complete. If a worker agent fails, the API server marks those tasks as failed and makes them available for reassignment. If a triggerer Agent fails, the API server immediately reassigns the tasks, since triggerer tasks are short-lived and idempotent.
Dag scheduling and retention during Agent disconnection
The Airflow scheduler retains all dags that were most recently parsed and sent by the dag processor agent. If the dag processor agent or any Remote Execution Agent disconnects or fails, the scheduler continues to use these previously parsed dags. The scheduler will keep creating dag runs on schedule or in response to events, such as dataset updates, for all retained dags.
- New or updated dags are not detected until a healthy dag processor Agent reconnects and provides an updated set of dags.
- All tasks and dag runs remain pending until a healthy Remote Execution Agent, worker or triggerer, is available for execution.
If no healthy Remote Execution Agents are connected, the scheduler continues to create dag runs for known dags but those tasks remain in queued state and will not execute until an agent becomes available.
If a task stays in queued state for more than 600 seconds (default) or the value set via the AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT environment variable on your Astro deployment, it will be marked as failed.
API Server failure
If an agent’s heartbeats can’t reach the API server, the agent assumes that the API server and other agents remain healthy. In this case:
- A worker continues running any tasks that the API server already marked as
running, but the worker doesn’t start new tasks until it reconnects with the API server. This prevents two agents from running the same task. - A triggerer stops processing tasks entirely until it restores connectivity. Since triggerer workloads are designed to be reassigned immediately when disconnected, trigger execution stops during the partition.
This behavior preserves task safety and prevents duplication for both workers and triggerers, even during partial failures or network partitions.