Orchestrate Great Expectations with Airflow
Great Expectations (GX) is an open source Python-based data validation framework. You can test your data by expressing what you “expect” from it as simple declarative statements in JSON or YAML, then run validations using those Expectation Suites against data SQL data, Filesystem Data or a pandas DataFrame. The airflow-provider-great-expectations package provides operators for running Great Expectations validations directly in your Dags.
Version 1.0.0 of the provider introduces three specialized operators that replace the legacy GreatExpectationsOperator:
Requirements: Python 3.10+, Great Expectations 1.7.0+, Apache Airflow 2.1+.
Install with:
Each operator has its own import path:
Choosing the right operator
When deciding which operator fits your use case, consider:
- Where is your data? In memory as a DataFrame, or in an external data source?
- Do you need to trigger actions? Such as sending notifications or updating external systems based on validation results.
- What Data Context do you need? Ephemeral for stateless validations, or persistent to track results over time.
GXValidateDataFrameOperator
Use this operator when your data is already in memory as a Pandas or Spark DataFrame. This is the simplest option, you only need a DataFrame and your expectations.
GXValidateDataFrameOperator example
configure_expectations signature
The configure_expectations callable receives the DataFrame as its first argument and must return an Expectation or ExpectationSuite.
GXValidateBatchOperator
Use this operator when your data is not in memory. You configure GX to connect directly to your data source by defining a BatchDefinition. This approach works with databases, warehouses, and file systems.
The provider includes helper functions to build connection strings from Airflow connections, so you do not need to duplicate credentials.
GXValidateBatchOperator example (SQL)
GXValidateBatchOperator example (file-based)
For file-based data (CSV, Parquet), use add_pandas_filesystem instead:
configure_expectations signature
For GXValidateBatchOperator, the configure_expectations callable receives the GX context as its first argument (not the DataFrame). Use context.suites.add_or_update() to register your suite.
GXValidateCheckpointOperator
Use this operator when you need the full power of GX Core, including the ability to trigger actions based on validation results. This requires the most configuration. You define a Checkpoint with a BatchDefinition, ExpectationSuite, and ValidationDefinition.