Intro to Airflow tutorial: Get started and run your first pipeline
Intro to Airflow tutorial: Get started and run your first pipeline
Intro to Airflow tutorial: Get started and run your first pipeline
This tutorial will get you started as quickly as possible while explaining the core concepts of Apache Airflow. You will explore galaxies 🌌 while extending an existing workflow with modern Airflow features, setting you up for diving into the world of data orchestration with Apache Airflow.
No matter if you are an absolute Airflow beginner or already know about certain concepts, in 5 minutes from now, you will have your first data pipeline, aka Dag, running in a fully functional Airflow environment!
Get a fully functional Airflow environment running in your browser with zero local setup using Astro IDE.
Create and run an ETL pipeline that processes galaxy data with extraction, transformation, and loading steps.
Learn Dags, tasks, operators, dependencies, and asset-aware scheduling through hands-on practice.
The first step is to start a free Astro trial.
All Astro accounts have access to the Astro IDE, which is the easiest way to develop Airflow Dags right in your browser. You can directly deploy your Dags from the Astro IDE to an Astro Deployment, an Airflow environment running in the cloud. After entering your email address, starting the trial includes 4 steps:
Choose between professional and personal. The choice has no impact on this tutorial.
Enter an organization and workspace name. Each customer has a dedicated organization on Astro. Each team or project has a workspace, which is a collection of deployments. A deployment is an Airflow environment hosted on Astro. For this tutorial, you can use any names.
You can choose to upload Dags, use a template, or start with an empty workspace. For this tutorial, choose start with a template.
Choose the ETL template.

Once your environment is created, you will find yourself in the Astro IDE with your very first ETL Dag, ready to be deployed! The Python code is a programmatic representation of your workflow, and by clicking the Start Test Deployment button on the top right, a fully functional Airflow environment will be started and your code will be deployed.
Click the Start Test Deployment button and wait for the deployment to finish.

Your first Airflow Dag is deployed and ready to be executed. Click on the dropdown next to Sync to Test and select Open Airflow.

The Airflow UI home dashboard of your Airflow instance will open in a new browser tab.

Within the navbar on the left, click on Dags.
This view shows all your Dags defined in your Python code. The ETL template comes with one Dag named example_etl_galaxies.

This ETL (Extract, Transform, Load) pipeline retrieves data about galaxies, filters them based on their distance from the Milky Way, and stores the results in a DuckDB database.

Tasks breakdown
create_galaxy_table_in_duckdb: Creates a table in DuckDB with columns for galaxy name, distances, type, and characteristics.extract_galaxy_data: Retrieves raw data about 20 galaxies and returns it as a pandas DataFrame.transform_galaxy_data: Filters the galaxy data to keep only galaxies within a specified distance from the Milky Way (default: 500,000 light years).load_galaxy_data: Inserts the filtered galaxy data into the DuckDB table and produces an Airflow Asset update.print_loaded_galaxies: Queries and prints all stored galaxies from DuckDB, sorted by distance from the Milky Way.Task dependencies
create_galaxy_table_in_duckdb → load_galaxy_data (table must exist before loading)extract_galaxy_data → transform_galaxy_data (raw data is needed for filtering)transform_galaxy_data → load_galaxy_data (filtered data is needed for loading)load_galaxy_data → print_loaded_galaxies (data must be loaded before printing)Let’s run the pipeline! Click the play button next to the Dag.

The button will open a trigger dialog, allowing you to trigger a single run or a backfill to process a range of dates right from the UI. Dags can also have parameters that can be used within the implementation to keep certain parts of your pipeline configurable.
Select Single Run, keep the parameters at their defaults, and click the Trigger button.

Your Dag will start, and under Latest Run in the Dags view it will show the current running instance of it.
Click that run date to got to the individual Dag run view.

Watch how the Dag run finishes and explore the grid and graph views (buttons on the top left), two different representation of your pipeline.
Once all tasks have finished successfully, open the grid view and click the print_loaded_galaxies task, the last step in your pipeline graph.

It will open the logs of this task instance and we see the output: a table of galaxies with their distance from the Milky Way and from our solar system, as well as the type of galaxy.

You just set up your Airflow development environment, started your first Airflow environment, and deployed and ran your first Dag. Take a moment to check the time and internalize what happened, isn’t this amazing?
Take your time to explore the UI, trigger more runs, check the logs of other tasks, and make yourself familiar with the interface. Feel free to read the Airflow UI guide for a deep dive into its different views and functionality.
Once you’ve finished your exploration, switch back to the Astro IDE and have a look at the Python code inside example_etl_galaxies.py. The code contains a lot of comments explaining each step in detail. However, let’s get an overview before you dive into details.
The Python file contains the following key elements:
airflow.sdk, as this is the user-facing SDK.schedule.task = PythonOperator(...) → returns operator directly.@task def my_task(): ... → creates operator, wrapped in XComArg.@task, @task.bash, @task.docker, @task.kubernetes, etc.XComArg: Wrapper enabling automatic data passing and dependency inference.Let’s level up! Now that you’ve run your first Dag, we’ll extend the project by adding a second Dag that builds on top of the first one.
We’ll create a galaxy_maintenance Dag that allows you to manually enter new galaxy data through an interactive form. The data will be automatically added to the database and validated with automated quality checks.
What you’ll learn:
Add provider packages to extend Airflow with new operators and integrations for databases and external systems.
Set up proper Airflow connections to manage credentials and configurations for external tools.
Implement human-in-the-loop workflows that pause for manual data entry and human decision-making.
Use common SQL operators to run parameterized queries across different database systems.
Add automated data quality checks to ensure data integrity throughout your pipelines.
Trigger Dags based on asset-aware scheduling rather than time schedules for data-driven workflows.
By the end of this section, you’ll have a powerful toolbox of concepts to explore Airflow further and confidently jump into your first real-world ETL/ELT project!
The example_etl_galaxies Dag currently connects directly to the DuckDB database using:
While this works, Airflow offers a better approach: common SQL operators that execute queries using Airflow connections. This unifies and simplifies SQL workloads across your pipelines. Let’s set this up properly.
Airflow’s core functionality can be extended with provider packages for specific use cases. We need two providers for our DuckDB connection.
Open the requirements.txt file in the Astro IDE.
Add the following lines at the bottom:
Since we added new dependencies, we need to sync the changes. Click on Sync to Test and wait for the changes to be deployed.
An Airflow connection stores configuration details for connecting to external tools in your data ecosystem. Most hooks (what is a hook?) and operators that interact with external systems require a connection.
To create the connection:
Open Airflow and click Admin in the left navbar
Select Connections
Click Add Connection (top right)
Enter the following details:
duckdb_astronomyinclude/astronomy.db
Save the connection and you’re now ready to connect! You can find the Airflow task that uses this connection in the example code in Step 4.3.
We just added a connection to our deployment (a single Airflow instance). If we deployed our Dags to another environment or recreated the test deployment, we’d need to add the connection again. Astro offers a helpful solution: under Environment → Connections in the Astro platform, you can set up workspace-wide connections that are available across all your Airflow instances. See Manage Airflow connections and variables in the Astro documentation.
The test deployment is a fully functional but minimal Airflow setup. To enable advanced features like asset-aware scheduling (explained later), we need to apply a quick configuration change.
AIRFLOW__SCHEDULER__USE_JOB_SCHEDULE by clicking the trash bin icon next to it.
Within the Astro IDE, create a new file by right-clicking on the dags folder → New File… and name it galaxy_maintenance.py.
Paste the following content:
This maintenance pipeline is triggered automatically whenever the galaxy data table is updated. It allows manual entry of new galaxy data through a human-in-the-loop interface, inserts the data into DuckDB, and runs data quality checks to ensure the values are within acceptable ranges.

Tasks Breakdown
enter_galaxy_details: Pauses the pipeline and prompts a user to manually enter galaxy information (name, distances, type, and characteristics) through a form interface.insert_galaxy_details: Inserts the user-provided galaxy data into the DuckDB table using the values collected from the previous task.dq_checks: Validates the data quality by checking that distance values are within acceptable ranges (between 10,000 and 900,000 light years).Task Dependencies
enter_galaxy_details → insert_galaxy_details (user input needed before insertion)insert_galaxy_details → dq_checks (data must be inserted before validation)parameters to have dynamic queries with placeholders. These will be handled on database-driver level.Click Sync to Test (top right) to sync your changes to the test deployment.
Once the sync process finishes, head back to the Airflow UI.
Open the Dags view, and a new Dag should appear in the list.
Notice how the schedule is set to be triggered whenever the asset named duckdb://include/astronomy.db/galaxy_data is updated.

Our first Dag updates this asset when data is loaded to DuckDB by using the outlets parameter:
@asset decorator). It is an abstract representation of data.schedule of a Dag to one or more assets, optionally with a logical expression using AND (&) and OR (|) operators, so that the Dag is triggered when these assets receive asset update events.outlets parameter, creating asset events when it completes successfully.Time to see asset-aware scheduling and your new Dag in action!
We store our DuckDB database in a project file (include/astronomy.db). The example_etl_galaxies Dag creates a table in this database, but the file isn’t included in our auto-generated project repository. As a result, each time we sync changes to the deployment, the database file disappears. To handle this, the insert_galaxy_details task in the second Dag uses CREATE TABLE IF NOT EXISTS in case the database file was removed between runs. To improve this, we could use a persistent database service, for example Snowflake or BigQuery.
Trigger example_etl_galaxies and observe what happens.
You’ll notice that galaxy_maintenance starts when example_etl_galaxies finishes. More precisely, when it updates the asset that triggers the other Dag.
Once galaxy_maintenance is running, open the latest run and you’ll notice there’s a required action. This is part of the human-in-the-loop feature: your task is waiting for user input.

Take time to explore the Airflow UI and see where these required actions are visible!
Open the required action to see the form we defined in the code, and enter the following details:

Click OK and observe how the pipeline proceeds. Pay close attention to the dq_checks task, which successfully validates the data.
Try it again by running galaxy_maintenance once more. This time, enter 42 as the distance and observe how the dq_checks task fails because the data quality check detected an issue with your galaxy data.
Congratulations 🎉! You’ve just built two interconnected data pipelines using Apache Airflow, and along the way you’ve learned the fundamental concepts that power modern data orchestration.
In this tutorial, you:
Ready to dive deeper?
Explore more guides:
Join the academy and get certified: