April 3, 2019

7 Common Errors to Check When Debugging Airflow DAGs

Paola Peraza Calderon Co-Founder Astronomer
B Ben Gregory Astronomer
Jake Witz Senior Technical Writer Astronomer

Apache Airflow® is the industry standard for workflow orchestration. It’s an incredibly flexible tool that powers mission-critical projects, from machine learning model training to traditional ETL at scale, for startups and Fortune 50 teams alike.

Airflow’s breadth and extensibility, however, can make it challenging to adopt — especially for those looking for guidance beyond day-one operations. In an effort to provide best practices and expand on existing resources, our team at Astronomer has collected some of the most common issues we see Airflow users face.

Whether you’re new to Airflow or an experienced user, check out this list of common errors and some corresponding fixes to consider.

Note: Following the Airflow 2.0 release in December of 2020, the open-source project has addressed a significant number of pain points commonly reported by users running previous versions. We strongly encourage your team to upgrade to Airflow 2.x.

If your team is running Airflow 1 and would like help establishing a migration path, reach out to us.

1. Your DAG Isn’t Running at the Expected Time

You wrote a new DAG that needs to run every hour and you’re ready to turn it on. You set an hourly interval beginning today at 2pm, setting a reminder to check back in a couple of hours. You hop on at 3:30pm to find that your DAG did in fact run, but your logs indicate that there was only one recorded execution at 2pm. Huh — what happened to the 3pm run?

Before you jump into debugging mode (you wouldn’t be the first), rest assured that this is expected behavior. The functionality of the Airflow Scheduler can be counterintuitive, but you’ll get the hang of it.

The two most important things to keep in mind about scheduling are:

By design, an Airflow DAG will run at the end of its schedule_interval.
Airflow operates in UTC by default.

Airflow's Schedule Interval

As stated above, an Airflow DAG will execute at the completion of its schedule_interval, which means one schedule_interval AFTER the start date. An hourly DAG, for example, will execute its 2:00 PM run when the clock strikes 3:00 PM. This happens because Airflow can’t ensure that all of the data from 2:00 PM - 3:00 PM is present until the end of that hourly interval.

This quirk is specific to Apache Airflow®, and it’s important to remember — especially if you’re using default variables and macros. Thankfully, Airflow 2.2+ simplifies DAG scheduling with the introduction of the timetables!

Use Timetables for Simpler Scheduling

There are some data engineering use cases that are difficult or even impossible to address with Airflow’s original scheduling method. Scheduling DAGs to skip holidays, run only at certain times, or otherwise run on varying intervals can cause major headaches if you’re relying solely on cron jobs or timedeltas.

This is why Airflow 2.2 introduced timetables as the new default scheduling method. Essentially, timetable is a DAG-level parameter that you can set to a Python function that contains your execution schedule.

A timetable is significantly more customizable than a cron job or timedelta. You can program varying schedules, conditional logic, and more, directly within your DAG schedule. And because timetables are imported as Airflow plugins, you can use community-developed timetables to quickly — and literally — get your DAG up to speed.

We recommend using timetables as your de facto scheduling mechanism in Airflow 2.2+. You might be creating timetables without even knowing it: if you define a schedule-interval, Airflow 2.2+ will convert it to a timetable behind the scenes.

Airflow Time Zones

Airflow stores datetime information in UTC internally and in the database. This behavior is shared by many databases and APIs, but it’s worth clarifying.

You should not expect your DAG executions to correspond to your local timezone. If you’re based in US Pacific Time, a DAG run of 19:00 will correspond to 12:00 local time.

In recent releases, the community has added more time zone-aware features to the Airflow UI. For more information, refer to Airflow documentation.

2. One of Your DAGs Isn’t Running

If workflows on your Deployment are generally running smoothly but you find that one specific DAG isn’t scheduling tasks or running at all, it might have something to do with how you set it to schedule.

Make sure you don't have `datetime.now()` as your `start_date`

It’s intuitive to think that if you tell your DAG to start “now” that it’ll execute immediately. But that’s not how Airflow reads datetime.now().

For a DAG to be executed, the start_date must be a time in the past, otherwise Airflow will assume that it’s not yet ready to execute. When Airflow evaluates your DAG file, it interprets datetime.now() as the current timestamp (i.e. NOT a time in the past) and decides that it’s not ready to run.

To properly trigger your DAG to run, make sure to insert a fixed time in the past and set catchup=False if you don’t want to perform a backfill.

Note: You can manually trigger a DAG run via Airflow’s UI directly on your dashboard (it looks like a “Play” button). A manual trigger executes immediately and will not interrupt regular scheduling, though it will be limited by any concurrency configurations you have at the deployment level, DAG level, or task level. When you look at corresponding logs, the run_id will show manual__ instead of scheduled__.

For more DAG tips, explore our webinars:

3. You’re Seeing a 503 Error on Your Deployment

If your Airflow UI is entirely inaccessible via web browser, you likely have a Webserver issue.

If you’ve already refreshed the page once or twice and continue to see a 503 error, read below for some Webserver-related guidelines.

Your Webserver Might Be Crashing

A 503 error might indicate an issue with your Deployment’s Webserver, which is the Airflow component responsible for rendering task state and task execution logs in the Airflow UI. If it’s underpowered or otherwise experiencing an issue, you can expect it to affect UI loading time or web browser accessibility.

In our experience, a 503 often indicates that your Webserver is crashing. If you push up a deploy and your Webserver takes longer than a few seconds to start, it might hit a timeout period (10 secs by default) that “crashes” the Webserver before it has time to spin up. That triggers a retry, which crashes again, and so on and so forth.

If your Deployment is in this state, your Webserver might be hitting a memory limit when loading your DAGs even as your Scheduler and Worker(s) continue to schedule and execute tasks.

Increase Webserver Resources

If your Webserver is hitting the timeout limit, a bump in Webserver resources usually does the trick.

If you’re using Astronomer, we generally recommend running the Webserver with a minimum of 5 AUs (Astronomer Units), which is equivalent to 0.5 CPUs and 1.88 GiB of memory. Even if you’re not running anything particularly heavy, underprovisioning your Webserver will likely return some funky behavior.

Increase the Webserver Timeout Period

If bumping Webserver resources doesn’t seem to have an effect, you might want to try increasing web_server_master_timeout or web_server_worker_timeout.

Raising those values will tell your Airflow Webserver to wait a bit longer to load before it hits you with a 503 (a timeout). You might still experience slow loading times if your Webserver is underpowered, but you’ll likely avoid hitting a 503.

Avoid Making Requests Outside of an Operator

If you’re making API calls, JSON requests, or database requests outside of an Airflow operator at a high frequency, your Webserver is much more likely to timeout.

When Airflow interprets a file to look for any valid DAGs, it first runs all code at the top level (i.e. outside of operators). Even if the operator itself only gets executed at execution time, everything outside of an operator is called every heartbeat, which can be very taxing on performance.

We’d recommend taking the logic you have currently running outside of an operator and moving it inside of a Python Operator if possible.

4. Sensor Tasks are Failing Intermittently

If your sensor tasks are failing, it might not be a problem with your task. It might be a problem with the sensor itself.

Be Careful When Using Sensors

By default, Airflow sensors run continuously and occupy a task slot in perpetuity until they find what they’re looking for, often causing concurrency issues. Unless you never have more than a few tasks running concurrently, we recommend avoiding them unless you know it won’t take too long for them to exit.

For example, if a worker can only run X number of tasks simultaneously and you have three sensors running, then you’ll only be able to run X-3 tasks at any given point. Keep in mind that if you’re running a sensor at all times, that limits how and when a scheduler restart can occur (or else it will fail the sensor).

Depending on your use case, we’d suggest considering the following:

Create a DAG that runs at a more frequent interval.
Trigger a Lambda function.
Set mode=’reschedule’. If you have more sensors than worker slots, the sensor will now get thrown into an up_for_reschedule state, which frees up its worker slot.

Replace Sensors with Deferrable Operators

If you’re running Airflow 2.2+, we recommend almost always using Deferrable Operators instead of sensors. These operators never use a worker slot when waiting for a condition to be met. Instead of using workers, deferrable operators poll for a status using a new Airflow component called the triggerer. Compared to using sensors, tasks with deferrable operators use a fraction of the resources to poll for a status.

As the Airflow community continues to adopt deferrable operators, the number of available deferrable operators is quickly growing. For more information on how to use deferrable operators, see our Deferrable Operators Guide.

5. Tasks are Executing Slowly

If your tasks are stuck in a bottleneck, we’d recommend taking a closer look at:

Environment variables and concurrency configurations
Worker and Scheduler resources

Update Concurrency Settings

The potential root cause for a bottleneck is specific to your setup. For example, are you running many DAGs at once, or one DAG with hundreds of concurrent tasks?

Regardless of your use case, configuring a few settings as parameters or environment variables can help improve performance. Use this section to learn what those variables are and how to set them.

Most users can set parameters in Airflow’s airflow.cfg file. If you’re using Astro, you can also set environment variables via the Astro UI or your project’s Dockerfile. We’ve formatted these settings as parameters for readability – the environment variables for these settings are formatted as AIRFLOW__CORE__PARAMETER_NAME. For all default values, refer here.

Parallelism

parallelism determines how many task instances can run in parallel across all DAGs given your environment resources. Think of this as “maximum active tasks anywhere.” To increase the limit of tasks set to run in parallel, set this value higher than its default of 32.

DAG Concurrency

max_active_tasks_per_dag (formerly dag_concurrency) determines how many task instances your Scheduler is able to schedule at once per DAG. Think of this as “maximum tasks that can be scheduled at once, per DAG.” The default is 16, but you should increase this if you’re not noticing an improvement in performance after provisioning more resources to Airflow.

Max Active Runs per DAG

max_active_runs_per_dag determines the maximum number of active DAG runs per DAG. This setting is most relevant when backfilling, as all of your DAGs are immediately vying for a limited number of resources. The default value is 16.

Pro-tip: If you consider setting DAG or deployment-level concurrency configurations to a low number to protect against API rate limits, we’d recommend instead using “pools” - they’ll allow you to limit parallelism at the task level and won’t limit scheduling or execution outside of the tasks that need it.

Worker Concurrency

Defined as AIRFLOW__CELERY__WORKER_CONCURRENCY=9, worker_concurrency determines how many tasks each Celery Worker can run at any given time. The Celery Executor will run a max of 16 tasks concurrently by default. Think of this as "how many tasks each of my workers can take on at any given time."

It's important to note that this number will naturally be limited by dag_concurrency. If you have 1 Worker and want it to match your Deployment's capacity, worker_concurrency should be equal to parallelism. The default value is 16.

Pro-tip: If you consider setting DAG or deployment-level concurrency configurations to a low number to protect against API rate limits, we'd recommend instead using "pools" - they'll allow you to limit parallelism at the task level and won't limit scheduling or execution outside of the tasks that need it.

Try Scaling Up Your Scheduler or Adding a Worker

If tasks are getting bottlenecked and your concurrency configurations are already optimized, the issue might be that your Scheduler is underpowered or that your Deployment could use another worker. If you’re running on Astro, we generally recommend 5 AU (0.5 CPUs and 1.88 GiB of memory) as the default minimum for the Scheduler and 10 AU (1 CPUs and 3.76 GiB of memory) for workers.

Whether or not you scale your current resources or add an extra Celery Worker depends on your use case, but we generally recommend the following:

If you’re running a relatively high number of light tasks across DAGs and at a relatively high frequency, you’re likely better off having 2 or 3 “light” workers to spread out the work.
If you’re running fewer but heavier tasks at a lower frequency, you’re likely better off with a single but “heavier” worker that can more efficiently execute those tasks.

For more information on the differences between Executors, we recommend reading Airflow Executors: Explained.

6. You’re Missing Task Logs

Generally speaking, logs fail to show up because of a process that died on your Scheduler or one or more of your Celery Workers.

If you’re missing logs, you might see something like this under “Log by attempts” in the Airflow UI:

Failed to fetch log file from worker. Invalid URL 'http://:8793/log/staging_to_presentation_pipeline_v5/redshift_to_s3_Order_Payment_17461/2019-01-11T00:00:00+00:00/1.log': No host supplied

A few things to try:

Clear the task instance via the Airflow UI to see if logs show up. This will prompt your task to run again.
Change the log_fetch_timeout_sec to something greater than 5 seconds. Defined in seconds, this setting determines the amount of time that the Webserver will wait for an initial handshake while fetching logs from other workers.
Give your workers a little more power. If you’re using Astro, you can do this in the Configure tab of the Astro UI.
Are you looking for a log from over 15 days ago? If you’re using Astro, the log retention period is an Environment Variable we have hard-coded on our platform. For now, you won’t have access to logs over 15 days old.
Exec into one of your Celery workers to look for the log files. If you’re running Airflow on Kubernetes or Docker, you can use kubectl or Docker commands to run $ kubectl exec -it {worker_name} bash. Log files should be in ~/logs. From there, they’ll be split up by DAG/TASK/RUN.
Try checking your Scheduler and Webserver logs to see if there are any errors that might tell you why your task logs are missing.

7. Tasks are Slow to Schedule and/or Have Stopped Being Scheduled Altogether

If your tasks are slower than usual to get scheduled, you might need to update Scheduler settings to increase performance and optimize your environment.

Just like with concurrency settings, users can set parameters in Airflow’s airflow.cfg file. If you’re using Astro, you can also set environment variables via the Astro UI or your project’s Dockerfile. We’ve formatted these settings as parameters for readability – the environment variables for these settings are formatted as AIRFLOW__CORE__PARAMETER_NAME. For all default values, refer here.

min_file_process_interval: The Scheduler parses your DAG files every min_file_process_interval number of seconds. Airflow starts using your update DAG code only after this interval ends. Because the Scheduler will parse your DAGs more often, setting this value to a low number will increase Scheduler CPU usage. If you have dynamic DAGs or otherwise complex code, you might want to increase this value to avoid poor Scheduler performance. By default, it’s set to 30 seconds.
dag_dir_list_interval: This setting determines how often Airflow should scan the DAGs directory in seconds. A lower value here means that new DAGs will be processed faster, but this comes at the cost of CPU usage. By default, this is set to 300 seconds (5 minutes). You might want to check how long it takes to parse your DAGs (dag_processing.total_parse_time) to know what value to choose for dag_dir_list_interval. If your dag_dir_list_interval is less than this value, then you might see performance issues.
parsing_processes: (formerly max_threads) The Scheduler can run multiple processes in parallel to parse DAGs, and this setting determines how many of those processes can run in parallel. We recommend setting this to 2x your available vCPUs. Increasing this value can help to serialize DAGs if you have a large number of them. By default, this is set to 2.

Note: Scheduler performance was a critical part of the Airflow 2 release and has seen significant improvements since December of 2020. If you are experiencing Scheduler issues, we strongly recommend upgrading to Airflow 2.x. For more information, read our blog post: The Airflow 2.0 Scheduler.

Was this helpful?

This list was curated by our team and based on our experience helping Astro customers, but we want to hear from you.

If you have follow up questions or are looking for Airflow support from our team, reach out to us.