Webinar Recap

Scheduling In Airflow

1. Scheduling basics before and After Airflow 2.2 release

Now that Airflow 2.2 is out, there is a new feature which is a real gamechanger.

a. Before:

Most important concepts:

  • start_date ⮕ Date at which tasks start being scheduled

    • Must be defined within your DAG
    • Always specify a static, non dynamical start date
  • scheduleinterval ⮕ Interval of time from the min(startdate) at which DAG is triggered

    • Can be monthly, daily, weekly - defines the frequency
  • end_date ⮕ Date at which your DAG stops being scheduled

    • Can be defined with Cron or time deltas.

The DAG [X] starts being scheduled from the startdate and will be triggered after every scheduleinterval.

Important to remember! DAG will get triggered at startdate PLUS scheduleinterval. The moment of running the DAG used to be called execution_date. Confusing!

Execution flow assuming a start_date at 10:00 AM and a schedule interval every 10 mins:

trigger-dags-any-schedule-image1

b. After: New concepts!

dataintervalstart = logicaldate = executiondate

  • dataintervalstart ⮕ Start date of the data interval = the actual execution date
  • dataintervalend ⮕ End date of the data interval
  • logicaldate ⮕ New name of the old executiondate

Execution flow:

trigger-dags-any-schedule-image2

2. Off-topic definitions

{{ dataintervalstart }}Start of the data interval (pendulum.Pendulum)
{{ dataintervalend }}End of the data interval (pendulum.Pendulum)
{{ ds }}Start of the data interval as YYYY-MM-DD. Same as {{ dataintervalstart | ds }}
{{ prevdataintervalstartsuccess }}Start of the data interval from prior successful DAG run (pendulum.Pendulum or None).
{{ prevdataintervalendsuccess }}End of the data interval from prior successful DAG run (pendulum.Pendulum or None).
{{ prevstartdate_success }}Start date from prior successful dag run (if available) (pendulum.Pendulum or None).

IMPORTANT! By default, all dates are converted in UTC. Stick with UTC, don’t mess up the time zones, you will only have a bad time :)

3. Do you know the difference between:

Defining the start date in default args and...

trigger-dags-any-schedule-image10

...defining the start date in the DAG object?

trigger-dags-any-schedule-image7

Have a look!

This is possible - each process has a different start date:

trigger-dags-any-schedule-image8

What will be the date of the first DAG Run?

The first DagRun to be created will be based on the min(start_date) for all your tasks - in that case the first task.

Takeaway: Always define the start date within your DAG! There is no point in using startdate in defaultargs. Startdate should only be used on task level if a user likes to have a different startdate for that particular task.

If start_date is specified on both dag level and task level, the max between them is selected.

4. What is the difference between a cron expression and a timedelta object?

Schedule interval can be defined by:

A Cron expression or…

trigger-dags-any-schedule-image5

...a Timedelta object:

trigger-dags-any-schedule-image12

What is the difference?

Cron every three days (stateless):

trigger-dags-any-schedule-image11

Timedelta every three days (stateful):

trigger-dags-any-schedule-image9

Timedelta will always keep the scheduled interval.

5. Daylight Saving Time

trigger-dags-any-schedule-image3

The main reason is that many countries use Daylight Saving Time (DST), where clocks are moved forward in spring and backward in autumn.

Time zone aware DAGs that use cron schedules respect daylight savings time.

Time zone aware DAGs that use timedelta or relativedelta schedules respect daylight savings time for the start date but do not adjust for daylight savings time when scheduling subsequent runs.

If you set a local timezone: CRON respects DST. Timedelta doesn’t.

6. What if you wanted to…

Schedule a DAG at different times on different days?

trigger-dags-any-schedule-image13

Schedule a DAG daily except for holidays?

trigger-dags-any-schedule-image6

Schedule a DAG at multiple times daily with uneven intervals (e.g. 1pm and 4:30pm)?

trigger-dags-any-schedule-image4

7. The answer is: The New Timetables!

All the scheduling flexibility and freedom you ever dreamed of.

Timetable steps: 1st: Define your constraints 2nd: Register your timetable as a plugin 3rd: Restart your web server and scheduler to implement modifications 4th: Implement your timetable

Code examples used at the webinar: Customizing DAG Scheduling with Timetables

Getting Apache Airflow Certified

Join the 1000s of other data engineers who have received the Astronomer Certification for Apache Airflow Fundamentals. This exam assesses an understanding of the basics of the Airflow architecture and the ability to create simple data pipelines for scheduling and monitoring tasks.