Data Quality Use Cases with Airflow and Great Expectations

Watch Video On Demand

Hosted By

  • Benji Lampel
  • Tal Gluck

Note: For more information check out the How to Improve Data Quality with Airflow’s Great Expectations Operator webinar and our Orchestrate Great Expectations with Airflow tutorial.

Webinar links:

1. About Great Expectations

Great Expectations is a shared, open standard for data quality that helps data teams eliminate pipeline debt through data testing, documentation, and profiling.

With Great Expectations, you can express expectations about your data — i.e. that a column contains no nulls, or that a table has twelve columns.

You can then test those expectations against your data and take action based on the results of those tests.

This tool allows you both to use your tests as a form of documentation and keep documentation in form of tests — so that the implicit assumptions about your data can be made explicit and shared around your organization.

2. Why Airflow + Great Expectations?

There’s a strong, documented story of using Airflow and Great Expectations together. It’s an excellent way to add data quality checks into your organization’s data ecosystem.

Use case: If you have a transformation pipeline, you can use Great Expectations at the start of it, before transforming your data, to make sure that things have loaded correctly — or, you can use Great Expectations after your transformations, to confirm that they succeeded.

3. Great Expectations Vocabulary Cheat Sheet

Datasources A Datasource is a configuration that describes where data is, how to connect to it, and which execution engine to use when running a Checkpoint.

Expectations Expectations, stored within Expectation Suites, provide a flexible, declarative language for describing expected behavior and verifiable properties of data.

Data Context A Data Context represents a Great Expectations project, organizing storage and access for Expectation Suites, Datasources, notification settings, and data fixtures.

Batch Request A Batch Request defines a batch of data to run expectations on from a given data source; it can describe data in a file, database, or dataframe.

Checkpoint Config Checkpoint Configs describe which Expectation Suite should be run against which data, and what actions to take during and after a run.

Checkpoints Checkpoints provide an abstraction for bundling the validation of a batch of data against an Expectation Suite and the actions that should be taken after the validation.

4. Great Expectations Operator v0.1x - V3 API Upgrade

What changed from V2 to V3?

V3 upgrade includes:

Checkpoint Model The new operator takes a simpler approach to running Great Expectations suites by only running checkpoints.

Data Sources Any Great Expectations-compatible Datasource can be added to a Data Context and run with the operator.

Configurations Default checkpoint values can be overwritten per-checkpoint at runtime with checkpoint kwargs.


Discussed and presented:

Write-Audit-Publish The “why” use case of this sort of DAG. What happens when things fail, who gets mad, how to prevent that pain.

MLFlow A use case for machine learning enthusiasts. What they would want to protect their data/machine learning models from, and why.

See more examples in the video and all the code in this github repo.

And don’t miss the Q&A!

Ready to Get Started?

See how your team can fuel its data workflows with more power and less complexity than ever before.

Start Free Trial →

Which plan works best for your team?

Learn about pricing →

What can Astronomer do for your organization?

Talk to an expert →