Why should you consider Airflow for your data platform?
Finalizing three weeks of research & development in our team, I’d like to tell more about criteria which led us to choose Airflow among other interesting platforms for workflow orchestration.
On the way to build a process of data transition, scanning, wrangling and cleaning it’s essential to have an easy learning path and be able to address orchestration of workflows and their releasing as separate procedures. Since this is a new component in our ecosystem we need to make sure that total cost of ownership (TCO) is low and we won’t invest too much of our limited resources to maintain it. We also wanted to have a tool which is capable of orchestrating workflows agnostic to programming language, literally meaning that tasks could be written in different languages and chained into the same workflow.
I should note here about the infrastructure environment which we took under consideration during research of available tools and techniques. We don’t use Kubernetes and cannot take tools which are based on its features. We also don’t have a dedicated DevOps person, so that our tasks related to building deployment pipelines should be short, clean and quite simple. And we don’t have a person specially for this, who could build an infrastructure. We’re a purely cross-functional team which takes everything on its own shoulders.
Before the research we took several candidates from our technical stack:
- Cron jobs
- .NET tools such as Hangfire and Quartz.NET
And analyzed them to fulfill following criteria:
- separated concerns of orchestration and tasks releasing;
- providing an extensive control, logging and monitoring over tasks;
- short learning curve to start developing for it;
- minimal efforts on understanding of configuration, its deployment and maintenance.
In order to stay with our technical stack we even discussed building such a tool on .NET, but the required efforts were too big. In the end we concluded following about three possible solutions:
- Jenkins as a workflow runner platform is good. But it’s no way a tool for orchestration of running of scheduled complex workflows, like those which are required for data processing. It also requires other tools to be integrated to implement your workflows.
- Kubernetes and cron jobs offer scheduled routines. That’s it. You’ll need to develop your own subsystem to organize workflows. Almost the same problem as with Jenkins, just from the opposite side.
- Hangfire, Quartz.NET or similar are the same things as Kubernetes jobs. They offer scheduling capability without real orchestration.
All of those solutions require additional investments into development of an orchestration platform, which would provide monitoring, logging and allow to run routines in a distributed manner, create chained steps, create dependencies between them and many more.
Airflow and alternatives
Because tools which we could choose were not offering required functionality, Airflow came up to the table. Before the final choice was made, we had to check a few more things, and specifically, its alternatives. The short list was as follows: Luigi, Argo, MLFlow, KubeFlow. In the comparison of those tools I got to read a lot of articles, but one I liked more than others and below there is a link to it.
Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow
Choosing a task orchestration tool
I would not make a review of those tools since there are a lot of such on the internet. But I’ll add a few words about Airflow vs. Luigi. Both are very good tools and I even considered Luigi as a tool to go in the beginning. But Airflow has a more clear DSL-like manner of defining DAGs, which is very important for development and maintenance. For us it is also important that there are already prepared Docker images and manuals for containerizing of Airflow deployment. And it has support on GCP with the Composer tool. These facts with other aspects, such as a more extensive feature set, bigger community and popularity in the industry made Airflow a winner for us.
How we use Airflow
There are several conditions of our team which we had to comply with while deploying Airflow, such as #1 not using Kubernetes and #2 limited human resources.
IaaC & Automation
We have a customer Docker image and docker-compose specification. With the deployment we can deploy Airflow to any machine in a few minutes. So that we have two instances: for production and for testing. And we can deploy it locally. Deployment requires a database for Airflow metadata. We use PostgreSQL which is also deployed in Docker. We have several Jenkins pipelines which are building Airflow itself and deploying Python shared packages and DAGs.
LocalExecutor vs. CeleryExecutor vs. KubernetesExecutor
As it’s obvious from their names, different executors allow them to run jobs in different environments. LocalExecutor runs workflows on the same machine where Airflow is deployed. This is the smallest possible production ready deployment. And we started from it and moved to a distributed deployment later.
The difference between Celery and Kubernetes executor is that the former requires a backend which might be provisioned with Redis, RabbitMQ or another.
Executor - Airflow Documentation
Executors are the mechanism by which task instances get run. Airflow has support for various executors. Current used is…
In the later articles I’ll touch on our experience with Airflow.
Airflow in few words
If you haven’t heard a lot about Airflow then below you can get some insights.
Some important concepts:
- DAGs — scheduled workflows are called DAGs (Directed Acyclic Graphs), DAGs may consists of a lot of tasks and basically only resources are limiting its number (for example, in real life there are DAGs of 200+ tasks)
- Tasks — they determine how a single routine should run in a DAG. You have full freedom of chaining tasks and controlling a data flow through them. They can produce and consume data from related tasks. Tasks are implemented in Python and usually utilize standard or custom Operators.
- XComs — small messages (no more then 48Kb) which can be transmitted from task to task during workflow execution.
- Operators — they determine what should be done, “a work” which needs to be executed. There are multiple standard operators which allow you to execute bash or python scripts, SQL statements, Docker images and many more. And you can write your own operators.
A complete information about concepts you can find following this link: https://airflow.apache.org/docs/apache-airflow/stable/concepts.html
The DAG itself should be described in Python. But it may start a Docker image which contains a routine written in any language. So that Airflow itself is just an orchestration tool which executes stateless steps. This aspect allows you separate releasing and deploying of routines from their orchestration.
Another useful aspect is that Airflow helps to keep configuration of tasks outside of tasks itself. Connection strings and settings could be stored in Airflow metadata and tasks may utilize it via Airflow API.
And of course, Airflow could be integrated with ELK stack and other metrics could be captured by Prometheus extractor and shown in Grafana.