Info
This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.
Soda Core is an open source framework for checking data quality. It uses the Soda Checks Language (SodaCL) to run checks defined in a YAML file.
Soda Core lets you:
In this tutorial, you’ll learn about the key features of Soda Core and how to use Airflow to run data quality checks on a database.
This tutorial takes approximately 30 minutes to complete.
To get the most out of this tutorial, make sure you have an understanding of:
To complete this tutorial, you need:
Configure a new Astro project to run Airflow locally.
Create a new Astro project:
Add the following line to the requirements.txt file of your Astro project:
This installs the relevant Soda Core Python package. If you are using a different database backend, replace snowflake with your backend. See the Prerequisites section for more details.
Run the following command to start your project in a local environment:
Create a configuration file to connect to your database backend. The following example uses the template from the Soda documentation to create the configuration file for Snowflake.
Save the YAML instructions in a file named configuration.yml and place the file into the /include directory of your Astro project.
Define your data quality checks using the many preset checks available for SodaCL. For more details on creating checks, see How it works. If you cannot find a preset check that works for your use case, you can create a custom one using SQL as shown in the following example.
Save the YAML instructions in a file named checks.yml and place the file in the /include directory of your Astro project.
In your Astro project dags/ folder, create a new file called soda-pipeline.py. Paste the following code into the file:
In this DAG, Soda Core checks are executed by using the BashOperator to run the soda scan command referencing the configuration and check YAML files.
Go to the Airflow UI, unpause your soda_example_dag DAG, and trigger it to run the Soda Core data quality checks. Go to the task log to see a list of all checks that ran and their results.
This is an example of what your logs might look like when 3 out of 3 checks pass:
In the case of a check failure, the logs show which check failed and the check_value that caused the failure:
Soda Core uses the Soda Checks Language (SodaCL) to run data quality checks defined in a YAML file. Integrating Soda Core into your Airflow data pipelines lets you use the results of data quality checks to influence downstream tasks. For an overview of SodaCL, see the SodaCL documentation.
As shown in the following example, you can use Soda Core to run checks on different properties of your dataset against a numerically defined threshold:
You can add optional configurations, such as custom names for checks and error levels:
You can use the following methods to check the validity of data:
Three more unique features of Soda Core are: