Intro to Apache Airflow DAGs
Data pipeline as a term is pretty straightforward. It leverages a common occurrence (i.e. plumbing) to illustrate what would otherwise be an unviewable process.
In practice, this analogy is a bit misleading, and if anything, fits for streaming architectures more than batch architectures.
Data isn’t literally in a single tube starting on one side and coming out of the other but it is isolated from other data during this time (as if in a physical pipe). Unlike water through a pipe data is transformed through a workflow and produces valuable metadata.
Directed Graph: A directed graph is any graph where the vertices and edges have some sort of order or direction associated with them.
Directed Acyclic Graph: Finally, a directed acyclic graph is a directed graph without any cycles. A cycle is just a series of vertices that connect back to each other in a closed chain.
In Airflow, each node in a DAG (soon to be known as a task) represents some form of data processing:
Node A could be the code for pulling data out of an API.
Node B could be the code for anonymizing the data and dropping any IP address.
Node D could be the code for checking that no duplicate record ids exist.
Node E could be putting that data into a database.
Node F could be running a SQL query on the new tables to update a dashboard.
Each of the verticies have a specific direction showing the relationship between nodes - data can only follow the direction of the vertices (from the example above, the IP addresses cannot be anonymized until the data has been pulled).
Node B is downstream from Node A - it won't execute until Node A finishes.
Workflows, particularly around those processing data, have to have a point of "completion." This especially holds true in batch architectures to be able to say that a certain "batch" ran successfully.
DAGs are a natural fit for batch architecture - they allow you to model natural dependencies that come up in data processing without and force you to architect your workflow with a sense of "completion."
Directed - If multiple tasks exist, each must have at least one defined upstream (previous) or downstream (subsequent) tasks, although they could easily have both.
Acyclic - No task can create data that goes on to reference itself. This could cause an infinite loop that would be, um, it’d be bad. Don’t do that.
Graph - All tasks are laid out in a clear structure with discrete processes occurring at set points and clear relationships made to other tasks.
Repeatable inputs for repeatable output
Best practices emerging around data engineering are pointing towards concepts found in functional programming. Ideas around immutability, isolation, and idempotency that are defining characteristics of good functional programming are very natural fits into good ETL architecture.
This concept will be stressed throughout everything - idempotency is one of, if not the most, important characteristic of good ETL architecture.
Something is idempotent if it will produce the same result regardless of how many times it is run. In mathematical terms:
$f(x) = f(f(x)) = f(f(f(x)))...$
Idempotency usually hand in hand with reproducability - a set of inputs always produces the same set of outputs
Making ETL jobs idempotent can be easier for workflows more than others - if it's just a SQL load for a days data, implementing upsert logic is pretty easy. If it's a file that's dropped on an externally controlled FTP that is not there for long, it is a little more difficult.
However, the initial investment is usually worth it for safety, operability, and modularity.
There should be a clear and intuitive association between tables, intermediate files, and all other levels of your data. DAGs are a natural fit here, as every task can have an exclusive target that does not propagate into another tasks target without a direct dependency being set.
Furthermore, this association should filter down into all metadata - logs, runtimes,
Mathematically, this idea of clarity and directness is intuitive:
$f(x) = y$
Do Airflow the easy way.