Apache Airflow® Quickstart - ETL
ETL: An introduction to modern, enhanced ETL development with Airflow.
Step 1: Clone the Astronomer Quickstart repository
-
Create a new directory for your project and open it:
mkdir airflow-quickstart-etl && cd airflow-quickstart-etl
-
Clone the repository and open it:
git clone -b etl-3 --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/etl
Your directory should have the following structure:
.
├── Dockerfile
├── README.md
├── dags
│ └── example_etl_galaxies.py
├── include
│ ├── astronomy.db
│ ├── custom_functions
│ │ └── galaxy_functions.py
│ └── data
│ └── galaxy_names.txt
├── packages.txt
├── requirements.txt
├── solutions
│ └── example_etl_galaxies_solution.py
└── tests
└── dags
└── test_dag_example.py
Step 2: Start up Airflow and explore the UI
-
Start the project using the Astro CLI:
astro dev start
The CLI will let you know when all Airflow services are up and running.
-
If it doesn't launch automtically, navigate your browser to
localhost:8080
and sign in to the Airflow UI using usernameadmin
and passwordadmin
.
At this time, Safari will not work properly with the UI. If Safari is your default browser, use Chrome to open Airflow 3.0.
-
Explore the Home screen and DAGs page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.
For example, the DAGs view will look like this screenshot:
-
You need to create a connection for the Postgres DB that is created with this quickstart. To do this go to Admin -> Connections then click Add Connection. Use the following connection information:
- Connection ID: postgres_default
- Connection Type: Postgres
- Host: host.docker.internal
- Login: example
- Password: example
- Port: 5435
-
Now that the Postgres connection is created, lets unpause the example_etl_galaxies_load DAG. Do that by clicking the toggle by the Trigger Run button.
-
Once you have triggered a few runs of the example_etl_galaxies DAG, you should notice that it has also triggered runs of the example_etl_galaxies_load DAG. If you go into the Assets screen, you'll be able to see the galaxy_data Asset which has 1 consuming DAG and 1 producing Task.
If you click on the galaxy_data Asset, you can see the graph view showing that the eample_etl_galaxies DAG feeds into the galaxy_data Asset that then gives data to the example_etl_galaxies_load DAG.
Step 3: Explore the project
Building Extract, Transform, and Load (ETL) workloads is a common pattern in Apache Airflow. This project shows an example pattern for defining an ETL workload using DuckDB as the data warehouse of choice.
As you try out this project, you'll see how easy Airflow makes it to:
- Write responsive pipelines that change based on user inputs.
- Perform database operations using SQL.
- Access and extract data from local files.
- Execute transformations with Pandas.
You'll write a lean ETL pipeline in easy-to-read Python code!
This project uses DuckDB, an in-memory database. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!
For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).
Pipeline structure
An Airflow project can have any number of DAGs (directed acyclic graphs), the main building blocks of Airflow pipelines. This project has one:
example_etl_galaxies
This DAG contains five @task
-decorated Python functions:
-
create_galaxy_table_in_duckdb
uses a hook to create a database connection and a SQL query to create a database table. -
extract_galaxy_data
returns a dataframe created using a modularized function imported from the project'sinclude
directory. -
transform_galaxy_data
gets a user-specified value from the DAG context and uses it to execute a simple data transformation on the dataframe, returning another dataframe. -
load_galaxy_data
uses a database hook to load the dataframe into the database. You can load the dataframe directly in the context of a SQL query. No conversion of the dataframe is required. -
print_loaded_galaxies
executes aSELECT
query on the database and prints the data to the logs.
example_etl_galaxies_load
This DAG extracts the data from the project's DuckDB database, creates a table in the Postgres DB, and loads the data into the table. Using an Airflow Asset trigger, it will run when example_etl_galaxies
updates the galaxy_data
asset.
This DAG contains five @task
-decorated Python functions:
-
create_galaxy_table_postgres
creates a table in the postgres DB -
extract_galaxy_data_duckdb
extracts the data from the duckdb database -
create_sql_query
creates the query to load the data into postgres -
load_galaxy_data_postgres
runs the query to load the data into postgres
For more information on connections check out the docs here.
Next Steps:
Run Airflow on Astro
The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.
Further Reading
Here are a few guides that may help you learn more about the topics discussed in this quickstart:
- The Connections guide will help you learn more about how to create and manage connections inside of Airflow 3.0
- The Assets and data-aware scheduling guide will teach you more about asset-oriented pipelines and using assets to schedule dags.