High Level Best Practices When Writing DAGs
Data pipelines are a messy business with a lot of various components that can fail.
Idempotent DAGs allow you to deliver results faster when something breaks and can save you from losing data down the road.
Use Airflow as an Orchestrator
Airflow was designed to be an orchestrator, not an execution framework.
In practice, this means:
- DO use Airflow to orchestrate jobs with other tools
- DO offload heavy processing to execution frameworks (e.g. Spark)
- DO use an ELT framework wherever possible
- DO use intermediary data storage
- DON’T pull large datasets into a task and process with Pandas (it’s tempting, we know)