Join us for Astro Days: NYC on Sept 27!
Webinar Recap

Airflow 101: Essential Tips For Beginners

By Kenten Danas, Lead Developer Advocate at Astronomer

Webinar Links:

1. What is Apache Airflow?

Apache Airflow is one of the world’s most popular data orchestration tools — an open-source platform that lets you programmatically author, schedule, and monitor your data pipelines.

Apache Airflow was created by Maxime Beauchemin in late 2014, and brought into the Apache Software Foundation’s Incubator Program two years later. In 2019, Airflow was announced as a Top-Level Apache Project, and it is now considered the industry’s leading workflow orchestration solution.

Key benefits of Airflow:

  • Proven core functionality for data pipelining
  • An extensible framework
  • Scalability
  • A large, vibrant community

Apache Airflow Core principles

Airflow is built on a set of core principles — and written in a highly flexible language, Python — that allow for enterprise-ready flexibility and reliability. It is highly secure and was designed with scalability and extensibility in mind.

2. Airflow core components

The infrastructure:

airflow-101-recap-image6

3. Airflow core concepts

DAGs

A DAG (Directed Acyclic Graph) is the structure of a data pipeline. A DAG run either extracts, transforms, or loads data - becoming a data pipeline, essentially.

DAGs must flow in one direction, which means that you should always avoid having loops in the code.

airflow-101-recap-image2

Each task in a DAG is defined by an operator, and there are specific downstream or upstream dependencies set between tasks.

airflow-101-recap-image3

Tasks

airflow-101-recap-image7

A task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and then have upstream and downstream dependencies set between them to express the order in which they should run. Best practice: keep your tasks atomic by making sure they only do one thing.

A task instance is a specific run of that task for a given DAG (and thus for a given data interval). Task instances also represent what stage of the lifecycle a given task is currently in. You will hear a lot about task instances (TI) working with Airflow.

Operators

Operators are the building blocks of Airflow. They determine what actually executes when your DAG runs. When you create an instance of an operator in a DAG and provide it with its required parameters, it becomes a task.

airflow-101-recap-image1

  • Action Operators execute pieces of code. For example, a Python action operator will run a Python function, a bash operator will run a bash script, etc.
  • Transfer Operators are more specialized, and designed to move data from one place to another.
  • Sensor Operators, frequently called “sensors,” are designed to wait for something to happen — for example, for a file to land on S3, or for another DAG to finish running.

Providers

airflow-101-recap-image5

Airflow providers are Python packages that contain all of the relevant Airflow modules for interacting with external services. Airflow is designed to fit into any stack: you can also use it to run your workloads in AWS, Snowflake, Databricks, or whatever else your team uses.

Most tools already have community-built Airflow modules, giving Airflow spectacular flexibility. Check out the Astronomer registry to find all the providers.

The following diagram shows how these concepts work in practice. As you can see, by writing a single DAG file in Python using an existing provider package, you can begin to define complex relationships between data and actions.

airflow-101-recap-image4

4. Best practices for beginners

  1. Design Idempotent DAGs
    DAG runs should produce the same result regardless of how many times they are run.
  2. Use Providers
    Don’t reinvent the wheel with Python Operators unless needed. Use provider packages for specific tasks. And go to the Astronomer Registry for Providers, everything there is to know about providers is right there.
  3. Keep Tasks Atomic
    When designing your DAG, each task should do a single unit of work. Use dependencies and trigger rules to schedule as needed.
  4. Keep Clean DAG Files
    Define one DAG per .py file. Keep any code that isn’t part of the DAG definition (e.g., SQL, Python scripts) in an /include directory.
  5. Use Connections
    Use Airflow’s Connections feature to keep sensitive information out of your DAG files.
  6. Use Template Fields
    Airflow’s variables and macros can be used to update DAGs at runtime and keep them idempotent.

For more, check out this DAG Best practices written guide and watch the webinar, including the Q&A session.

5. Demo

Watch the Demo to learn:

  • How to get Airflow running: Astronomer’s CLI “Astro” is probably the easiest way to get started with Airflow — very straightforward.
  • The basics of Airflow UI.
  • Error notifications in the UI.
  • An example of DAGs and task instances.
  • How to initialize a project in the terminal.
  • How to use providers.
  • How to define a DAG from scratch.
  • How to create task dependencies.
  • Kenten’s secret best practices.

You can find the code from the webinar in this Github repo.

Get Apache Airflow Certified

Join the 1000s of other data engineers who have received the Astronomer Certification for Apache Airflow Fundamentals. This exam assesses an understanding of the basics of the Airflow architecture and the ability to create simple data pipelines for scheduling and monitoring tasks.

RELATED CONTENT:

  • Airflow Components - An Overview
  • 10 Best Practices for Airflow Users
  • The Airflow UI - Most Common Features

Keep Your Data Flowing with Astro

Get a demo that’s customized around your unique data orchestration workflows and pain-points.

Get Started