Maximizing Data Workflow Efficiency: The Advantages of Using Airflow with Azure Data Factory

  • George Yates

In the world of data engineering, combining the strengths of different tools is a must have, not a nice to have. Azure Data Factory (ADF) is the go to, user-friendly, tool that plugs into the Azure ecosystem. When paired with Apache Airflow, the leading open-source workflow management tool, the duo sets the foundation for data orchestration that can work across the gamut of an enterprise.

This blog post explores how using Airflow alongside ADF can be used side by side in the context of a real world example in a use case where a retail company needs to regularly extract, transform, and load customer data from various sources into their database for analysis and reporting.

Why integrate Airflow with ADF

ADF excels in creating quick and low-code data jobs with an intuitive UX. By layering in Airflow’s orchestration capabilities on top of ADF workflows, developers get end-to-end visibility of their workflows without needing to migrate any jobs. Airflow’s pythonic expressiveness allows it to create more complex, conditional workflows that can change over time. Additionally, you can start managing ADF pipelines through Airflow without needing to alter or replace existing ADF pipelines, making it purely additive to ADF’s existing capabilities.

Another great benefit that Airflow provides when used alongside ADF is its ability to connect with a wide array of services outside the Azure environment and seamlessly link them to ADF workflows. With Airflow, you can integrate operations before or after ADF workflows are triggered, as well as transferring information between ADF and external platforms. This expanded integration enables data pipelines to extend beyond Azure services and sets the stage for hybrid and multi-cloud workloads.

Utilizing Airflow as a control plane for ADF jobs brings together the best of both worlds, ultimately enabling users to monitor their ADF jobs through the Airflow UI alongside any other pipelines they’re running. This synergy ensures efficiency and cohesion in workflow management, providing users with a single pane of glass for both monitoring and alerting.

ADF and Airflow In Practice

The following DAG demonstrates a real world ETL and analysis process in which ADF is used to conduct the transfer operations efficiently between Azure services, before Airflow is used to trigger a Azure Synapse job that consumes that data for an analysis script. The DAG includes multiple instances of the AzureDataFactoryRunPipelineOperator as well as a GreatExpectationsOperator, showcasing how Airflow can be used to implement a cross platform ETL and Data Quality check workflow from a single pane of glass. These operators demonstrate Airflow’s capability to trigger and manage multiple ADF pipelines – CustomerExtraction, CleanRawCustomerData, and LoadCleanData, alongside other applications like Great Expectations in one seamless experience. This flexibility to run multiple ADF pipelines in parallel within the same DAG illustrates how Airflow can enhance ADF’s data processing capabilities.

RESOURCE_GROUP = 'DemoGroup'
FACTORY_NAME = 'DemoDFYates'


example_dataset = Dataset("AzureSet")


default_args = {
  'owner': 'airflow',
  'depends_on_past': False,
  'email_on_failure': False,
  'email_on_retry': False,
}


with DAG('azure_services_dag',
        default_args=default_args,
        description='An Airflow DAG that uses Azure services',
        schedule_interval=timedelta(days=1),
        start_date=datetime(2023, 10, 31),
        catchup=False) as dag:

  start = DummyOperator(task_id="start")


  ingest_data = AzureDataFactoryRunPipelineOperator.partial(
      task_id='ingest_data',
      azure_data_factory_conn_id='azure_conn',
      resource_group_name=RESOURCE_GROUP,
      factory_name=FACTORY_NAME).expand(pipeline_name=['CustomerExtraction','CustomerExtraction1','CustomerExtraction2'])


  transform_data = AzureDataFactoryRunPipelineOperator(
      task_id='transform_data',
      azure_data_factory_conn_id='azure_conn',
      resource_group_name=RESOURCE_GROUP,
      factory_name=FACTORY_NAME,
      pipeline_name='CleanRawCustomerData'
  )


  store_data = AzureDataFactoryRunPipelineOperator(
      task_id='load_into_mssql',
      azure_data_factory_conn_id='azure_conn',
      resource_group_name=RESOURCE_GROUP,
      factory_name=FACTORY_NAME,
      pipeline_name='LoadCleanData',
      outlets=[example_dataset]
  )


  GXAnalysis = GreatExpectationsOperator(
  schema="Sample",
  expectation_suite_name="ConformstoNorm",
  execution_engine="default",
  conn_id="greatexpectations",
  run_name="testrun",
  )


  # Define dependency chain
  start >> ingest_data >> transform_data >> store_data >> GXAnalysis

In summary, the synergy of Airflow and ADF provides a comprehensive toolkit for modern data engineers and businesses. It epitomizes the concept of using the right tool for the right job, combining ease of use with powerful functionality, thus enabling the creation of more efficient, scalable, and flexible data pipelines.

To experience the seamless integration of Airflow with ADF firsthand, try Astro on Azure today!

Ready to Get Started?

Get Started Free

Try Astro free for 14 days and power your next big data project.