Webinar Recap

Intro to Apache Airflow

Topics to be discussed:

  • What is Airflow
  • Core Components
  • Core Concepts
  • Flexibility of Pipelines as Code
  • Getting Airflow up and Running
  • Demo DAGs

What is Airflow?

Definition: Apache Airflow is a way to programmatically author, schedule and monitor data pipelines.

  • Airflow is the De-facto Standard for Data Orchestration
  • Born inside AirBnB, open-sourced, and graduated to a Top-Level Apache Software Foundation Project
  • Leveraged by 1M+ data engineers around the globe to programmatically author, schedule, and monitor data pipelines
  • Deployed by 1000s of companies as the unbiased data control plane, translating business rules to power their data processing fabric

Core Components

intro-airflow-1

Executors

  • Local- good for local and development environments
  • Celery- good for high volume of short tasks
  • Kubernetes- good for autoscaling and task-level configuration

Plus Sequential, but no parallelism with this one

intro-airflow-2

intro-airflow-3

Task

  • Instance of an Operator

Task Instance

  • Represents a specific run of a task: DAG + TASK + Point in time

Flexibility of Data Pipelines-as-Code

intro-airflow-4

intro-airflow-5

Getting Started with Apache Airflow

The easiest way to get started with providers and Apache Airflow 2.0 is by using the Astronomer CLI. To make it easy you can get up and running with Airflow by following our Quickstart Guide.

Join the 1000’s of other data engineers who have received the Astronomer Certification for Apache Airflow Fundamentals. This exam assesses an understanding of the basics of the Airflow architecture and the ability to create basic data pipelines for scheduling and monitoring tasks.

To access the demo DAGs used in the Intro to Airflow webinar, visit this Github repository