For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
      • AstroFully-managed data operations, powered by Apache Airflow.
      • Astro Private CloudRun Airflow-as-a-service in your environment.
      • Professional ServicesExpert Airflow services for your enterprise's success.
    • Tools
      • Cosmos
      • Orbiter
      • CLI
      • AI SDK
      • Agents
      • Blueprint
      • UpdatesThe State of Airflow 2026See the insights from over 5,800 data practitioners in the full report. Download Now ➔
  • Customers
  • Docs
    • Insights
      • Blog
      • Webinars
      • Resource Library
      • Events
    • Education
      • Academy
      • What is Airflow?
  • Pricing
Get Started Free
    • Overview
      • Anyscale
      • Azure Blob Storage
      • Azure Container Instances
      • Azure Data Factory integration
      • BigQuery
      • Cohere
      • Common AI
      • dbt
      • DuckDB
      • Entra Workload Identity
      • Execute notebooks
      • Fivetran
      • Great Expectations
      • Kafka
      • Marquez
      • MongoDB
      • MS SQL Server
      • OpenAI
      • OpenSearch
      • pgvector
      • Pinecone
      • PostgreSQL
      • Qdrant
      • Ray
      • SageMaker
      • Soda data quality
      • Weaviate
      • Weights and Biases
    • Glossary

Product

  • Platform Overview
  • Astro
  • Astro Observe
  • Astro Private Cloud
  • Security & Trust
  • Pricing

Tools & Services

  • Cosmos
  • Docs
  • Professional Services
  • Product Updates

Use Cases

  • AI Ops
  • Data Observability
  • ETL/ELT
  • ML Ops
  • Operational Analytics
  • All Use Cases

Industries

  • Financial Services
  • Gaming
  • Retail
  • Manufacturing
  • Healthcare
  • All Industries

Resources

  • Academy
  • eBooks & Guides
  • Blog
  • Webinars
  • Events
  • The Data Flowcast Podcast
  • All Resources

Airflow

  • What is Airflow
  • Airflow on Astro
  • Airflow 3.0
  • Airflow Upgrades
  • Airflow Use Cases
  • Airflow 2.x End of Life

Company

  • Our Story
  • Customers
  • Newsroom
  • Careers
  • Contact

Support

  • Knowledge Base
  • Status
  • Contact Support
GitHubYouTubeLinkedInx
  • Legal
  • Privacy
  • Terms of Service
  • Consent Preferences

  • Do Not Sell or Share My Personal information
  • Limit the Use Of My Sensitive Personal Information

Apache Airflow®, Airflow, and the Airflow logo are trademarks of the Apache Software Foundation. Copyright © Astronomer 2026. All rights reserved.

LogoLogo
On this page
  • Time to complete
  • Assumed knowledge
  • Prerequisites
  • Step 1: Configure your Astro project
  • Step 2: Create the configuration file
  • Step 3: Create the checks file
  • Step 4: Create your DAG
  • Step 5: Run your DAG and review check results
  • How it works
Integrations & connections

Run Soda Core checks with Airflow

Edit this page
Built with

Info

This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.

Soda Core is an open source framework for checking data quality. It uses the Soda Checks Language (SodaCL) to run checks defined in a YAML file.

Soda Core lets you:

  • Define checks as YAML configuration, including many preset checks.
  • Provide a SQL query within the YAML file and check against a returned value if no preset checks fit your use case.
  • Integrate data quality checks with commonly used data engineering tools such as Airflow, Apache Spark, PostgreSQL, Snowflake and more.

In this tutorial, you’ll learn about the key features of Soda Core and how to use Airflow to run data quality checks on a database.

Time to complete

This tutorial takes approximately 30 minutes to complete.

Assumed knowledge

To get the most out of this tutorial, make sure you have an understanding of:

  • How to design a data quality process. See Data quality and Airflow.
  • The basics of Soda Core. See How Soda Core works.
  • How to use the BashOperator. See Using the BashOperator.
  • Relational Databases. See IBM’s “Relational Databases Explained”.
  • Familiarity with writing YAML configurations. See yaml.org.

Prerequisites

To complete this tutorial, you need:

  • The Soda Core package for your database backend. The Soda documentation provides a list of supported databases and how to configure them. This tutorial uses Snowflake.
  • The Astro CLI.

Step 1: Configure your Astro project

Configure a new Astro project to run Airflow locally.

  1. Create a new Astro project:

    1$ mkdir astro-soda-tutorial && cd astro-soda-tutorial
    2$ astro dev init
  2. Add the following line to the requirements.txt file of your Astro project:

    soda-core-snowflake

    This installs the relevant Soda Core Python package. If you are using a different database backend, replace snowflake with your backend. See the Prerequisites section for more details.

  3. Run the following command to start your project in a local environment:

    1astro dev start

Step 2: Create the configuration file

Create a configuration file to connect to your database backend. The following example uses the template from the Soda documentation to create the configuration file for Snowflake.

1# the first line names the datasource "MY_DATASOURCE"
2data_source MY_DATASOURCE:
3 type: snowflake
4 connection:
5 # provide your snowflake username and password in double quotes
6 username: "MY_USERNAME"
7 password: "MY_PASSWORD"
8 # provide the account in the format xy12345.eu-central-1
9 account: my_account
10 database: MY_DATABASE
11 warehouse: MY_WAREHOUSE
12 # if your connection times out you may need to adjust the timeout value
13 connection_timeout: 300
14 role: MY_ROLE
15 client_session_keep_alive:
16 session_parameters:
17 QUERY_TAG: soda-queries
18 QUOTED_IDENTIFIERS_IGNORE_CASE: false
19 schema: MY_SCHEMA

Save the YAML instructions in a file named configuration.yml and place the file into the /include directory of your Astro project.

Step 3: Create the checks file

Define your data quality checks using the many preset checks available for SodaCL. For more details on creating checks, see How it works. If you cannot find a preset check that works for your use case, you can create a custom one using SQL as shown in the following example.

1checks for example_table:
2 # check that MY_EMAIL_COL contains only email addresses according to the
3 # format name@domain.extension
4 - invalid_count(MY_EMAIL_COL) = 0:
5 valid format: email
6 # check that all entries in MY_DATE_COL are unique
7 - duplicate_count(MY_DATE_COL) = 0
8 # check that MY_TEXT_COL has no missing values
9 - missing_count(MY_TEXT_COL) = 0
10 # check that MY_TEXT_COL has at least 10 distinct values using SQL
11 - distinct_vals >= 10:
12 distinct_vals query: |
13 SELECT DISTINCT(MY_TEXT_COL) FROM example_table
14 # check that MY_NUM_COL has a minimum between 90 and 110
15 - min(MY_NUM_COL) between 90 and 110
16 # check that example table has at least 1000 rows
17 - row_count >= 1000
18 # check that the sum of MY_COL_2 is bigger than the sum of MY_COL_1
19 - sum_difference > 0:
20 sum_difference query: |
21 SELECT SUM(MY_COL_2) - SUM(MY_COL_1) FROM example_table
22 # checks that all entries in MY_COL_3 are part of a set of possible values
23 - invalid_count(MY_COL_3) = 0:
24 valid values: [val1, val2, val3, val4]

Save the YAML instructions in a file named checks.yml and place the file in the /include directory of your Astro project.

Step 4: Create your DAG

In your Astro project dags/ folder, create a new file called soda-pipeline.py. Paste the following code into the file:

1from airflow.models.dag import DAG
2from datetime import datetime
3
4from airflow.operators.bash import BashOperator
5
6SODA_PATH="<filepath>" # can be specified as an env variable
7
8with DAG(
9 dag_id="soda_example_dag",
10 schedule='@daily',
11 start_date=datetime(2022,8,1),
12 catchup=False
13) as dag:
14
15 soda_test = BashOperator(
16 task_id="soda_test",
17 bash_command=f"soda scan -d MY_DATASOURCE -c \
18 {SODA_PATH}/configuration.yml {SODA_PATH}/checks.yml"
19 )

In this DAG, Soda Core checks are executed by using the BashOperator to run the soda scan command referencing the configuration and check YAML files.

Step 5: Run your DAG and review check results

Go to the Airflow UI, unpause your soda_example_dag DAG, and trigger it to run the Soda Core data quality checks. Go to the task log to see a list of all checks that ran and their results.

This is an example of what your logs might look like when 3 out of 3 checks pass:

[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - Scan summary:
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - 3/3 checks PASSED:
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - MY_TABLE in MY_DATASOURCE
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - duplicate_count(MY_ID_COLUMN) = 0 [PASSED]
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - missing_count(MY_ID_COLUMN) = 0 [PASSED]
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - min(MY_NUM_COL) between 0 and 10 [PASSED]
[2022-08-04, 13:07:22 UTC] {subprocess.py:92} INFO - All is good. No failures. No warnings. No errors.

In the case of a check failure, the logs show which check failed and the check_value that caused the failure:

[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - Scan summary:
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - 2/3 checks PASSED:
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - MY_TABLE in MY_DATASOURCE
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - duplicate_count(MY_ID_COLUMN) = 0 [PASSED]
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - missing_count(MY_ID_COLUMN) = 0 [PASSED]
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - 1/3 checks FAILED:
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - MY_TABLE in MY_DATASOURCE
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - max(MY_NUM_COL) between 10 and 20 [FAILED]
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - check_value: 3
[2022-08-04, 13:23:59 UTC] {subprocess.py:92} INFO - Oops! 1 failures. 0 warnings. 0 errors. 2 pass.
[2022-08-04, 13:24:00 UTC] {subprocess.py:96} INFO - Command exited with return code 2
[2022-08-04, 13:24:00 UTC] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airflow/operators/bash.py", line 194, in execute
raise AirflowException(
airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 2.

How it works

Soda Core uses the Soda Checks Language (SodaCL) to run data quality checks defined in a YAML file. Integrating Soda Core into your Airflow data pipelines lets you use the results of data quality checks to influence downstream tasks. For an overview of SodaCL, see the SodaCL documentation.

As shown in the following example, you can use Soda Core to run checks on different properties of your dataset against a numerically defined threshold:

1checks for MY_TABLE_1:
2 # MY_NUM_COL_1 has a minimum of above or equal 0
3 - min(MY_NUM_COL_1) >= 0
4 # MY_TEXT_COL has less than 10% missing values
5 - missing_percent(MY_TEXT_COL) < 10
6checks for MY_TABLE_2:
7 # MY_NUM_COL_2 has an average between 100 and 1000
8 - avg(MY_NUM_COL_2) is between 100 and 1000
9 # MY_ID_COL has no duplicates
10 - duplicate_count(MY_ID_COL) = 0

You can add optional configurations, such as custom names for checks and error levels:

1checks for MY_TABLE_1:
2 # fail the check when MY_TABLE_1 has less than 10 or more than a million rows
3 # warn if there are less than 100 (but 10 or more) rows
4 - row_count:
5 warn: when < 100
6 fail:
7 when < 10
8 when > 1000000
9 name: Wrong number of rows!

You can use the following methods to check the validity of data:

  • List of valid values
  • Predefined valid format
  • Regex
  • SQL query
1checks for MY_TABLE_1:
2 # MY_CATEGORICAL_COL has no other values than val1, val2 and val3
3 - invalid_count(MY_CATEGORICAL_COL) = 0:
4 valid values: [val1, val2, val3]
5 # MY_NUMERIC_COL has no other values than 0, 1 and 2.
6 # Single quotes are necessary for valid values checks involving numeric
7 # characters.
8 - invalid_count(MY_NUMERIC_COL) = 0:
9 valid values: ['0', '1', '2']
10 # less than 10 missing valid IP addresses
11 - missing_count(IP_ADDRESS_COL) < 10:
12 valid format: ip address
13 # WEBSITE_COL has less than 5% entries that don't contain "astronomer.io"
14 - invalid_percent(WEBSITE_COL) < 5:
15 valid regex: astronomer\.io
16 # The average of 3 columns for values of category_1 is between 10 and 100
17 - my_average_total_for_category_1 between 10 and 100:
18 my_average_total_for_category_1 query: |
19 SELECT AVG(MY_COL_1 + MY_COL_2 + MY_COL_3)
20 FROM MY_TABLE_1
21 WHERE MY_CATEGORY = 'category_1'

Three more unique features of Soda Core are:

  • Freshness checks: Set limits to the age of the youngest row in the table.
  • Schema checks: Run checks on the existence of columns and validate data types.
  • Reference checks: Ensure parity in between columns in different datasets in the same data source.
1checks for MY_TABLE_1:
2 # MY_DATE's youngest row is younger 10 days
3 - freshness(MY_DATE) < 10d
4 # The schema has to have the MY_KEY column
5 - schema:
6 fail:
7 when required column missing: [MY_KEY]
8 # all names listed in MY_TABLE_1's MY_NAMES column have to also exist
9 # in the MY_CUSTOMER_NAMES column in MY_TABLE_2
10 - values in (MY_NAMES) must exist in MY_TABLE_2 (MY_CUSTOMER_NAMES)