Skip to main content
Version: Airflow 3.x

Apache Airflow® Quickstart - ETL

ETL: An introduction to modern, enhanced ETL development with Airflow.

Step 1: Clone the Astronomer Quickstart repository

  1. Create a new directory for your project and open it:

    mkdir airflow-quickstart-etl && cd airflow-quickstart-etl
  2. Clone the repository and open it:

    git clone -b etl-3 --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/etl

    Your directory should have the following structure:

    .
    ├── Dockerfile
    ├── README.md
    ├── dags
    │ └── example_etl_galaxies.py
    ├── include
    │ ├── astronomy.db
    │ ├── custom_functions
    │ │ └── galaxy_functions.py
    │ └── data
    │ └── galaxy_names.txt
    ├── packages.txt
    ├── requirements.txt
    ├── solutions
    │ └── example_etl_galaxies_solution.py
    └── tests
    └── dags
    └── test_dag_example.py

Step 2: Start up Airflow and explore the UI

  1. Start the project using the Astro CLI:

    astro dev start

    The CLI will let you know when all Airflow services are up and running.

  2. If it doesn't launch automtically, navigate your browser to localhost:8080 and sign in to the Airflow UI using username admin and password admin.

tip

At this time, Safari will not work properly with the UI. If Safari is your default browser, use Chrome to open Airflow 3.0.

  1. Explore the Home screen and DAGs page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.

    For example, the DAGs view will look like this screenshot:

    Airfllow UI DAGs view

  2. You need to create a connection for the Postgres DB that is created with this quickstart. To do this go to Admin -> Connections then click Add Connection. Use the following connection information:

  • Connection ID: postgres_default
  • Connection Type: Postgres
  • Host: host.docker.internal
  • Login: example
  • Password: example
  • Port: 5435
  1. Now that the Postgres connection is created, lets unpause the example_etl_galaxies_load DAG. Do that by clicking the toggle by the Trigger Run button.

    Airfllow UI Unpause DAG

  2. Once you have triggered a few runs of the example_etl_galaxies DAG, you should notice that it has also triggered runs of the example_etl_galaxies_load DAG. If you go into the Assets screen, you'll be able to see the galaxy_data Asset which has 1 consuming DAG and 1 producing Task.

    Example Asset View

    If you click on the galaxy_data Asset, you can see the graph view showing that the eample_etl_galaxies DAG feeds into the galaxy_data Asset that then gives data to the example_etl_galaxies_load DAG.

    Example Asset Graph View

Step 3: Explore the project

Building Extract, Transform, and Load (ETL) workloads is a common pattern in Apache Airflow. This project shows an example pattern for defining an ETL workload using DuckDB as the data warehouse of choice.

As you try out this project, you'll see how easy Airflow makes it to:

  • Write responsive pipelines that change based on user inputs.
  • Perform database operations using SQL.
  • Access and extract data from local files.
  • Execute transformations with Pandas.

You'll write a lean ETL pipeline in easy-to-read Python code!

warning

This project uses DuckDB, an in-memory database. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!

For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).

Pipeline structure

An Airflow project can have any number of DAGs (directed acyclic graphs), the main building blocks of Airflow pipelines. This project has one:

example_etl_galaxies

This DAG contains five @task-decorated Python functions:

  • create_galaxy_table_in_duckdb uses a hook to create a database connection and a SQL query to create a database table.

  • extract_galaxy_data returns a dataframe created using a modularized function imported from the project's include directory.

  • transform_galaxy_data gets a user-specified value from the DAG context and uses it to execute a simple data transformation on the dataframe, returning another dataframe.

  • load_galaxy_data uses a database hook to load the dataframe into the database. You can load the dataframe directly in the context of a SQL query. No conversion of the dataframe is required.

  • print_loaded_galaxies executes a SELECT query on the database and prints the data to the logs.

example_etl_galaxies_load

This DAG extracts the data from the project's DuckDB database, creates a table in the Postgres DB, and loads the data into the table. Using an Airflow Asset trigger, it will run when example_etl_galaxies updates the galaxy_data asset.

This DAG contains five @task-decorated Python functions:

  • create_galaxy_table_postgres creates a table in the postgres DB

  • extract_galaxy_data_duckdb extracts the data from the duckdb database

  • create_sql_query creates the query to load the data into postgres

  • load_galaxy_data_postgres runs the query to load the data into postgres

For more information on connections check out the docs here.

Next Steps:

Run Airflow on Astro

The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.

Further Reading

Here are a few guides that may help you learn more about the topics discussed in this quickstart:

Was this page helpful?