Airflow is a platform to programmatically author, schedule and monitor workflows.These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800contributors on GitHub and 13000stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.
Apache airflow is a workflow (data-pipeline) management system developed by Airbnb. It is used by more than 200 companies such as Airbnb, Yahoo, PayPal, Intel, Stripe and many more.
In this, everything revolves around workflow objects implemented as _directed acyclic graphs _(DAG). For example, such a workflow can involve the merging of multiple data sources and the subsequent execution of an analysis script. It takes care of scheduling the tasks while respecting their internal dependencies and orchestrates the systems involved.
什么是Workflow?
Workflow is a sequence of tasks which is started on a schedule or triggered by an event .It is frequently used to handle big data processing pipelines.
workflow是按计划启动或由事件触发的一系列任务。它经常用于处理大数据处理管道。
A typical workflow diagram
典型的工作流程图
There are total 5 phases in any workflow.
任何工作流中总共有 5 个阶段。
Firstly we download data from source
首先,我们从源头下载数据。
Then, send that data to somewhere else to process
然后,将该数据发送到其他地方进行处理
When the process is completed we get the result and report is generated which is sent by email.
该过程完成后,我们获得结果并生成报告,并通过电子邮件发送。
Working of Apache Airflow
Airflow 的工作原理
There are four main components that make up this robust and scalable workflow scheduling platform:
有四个主要组件组成了这个强大且可扩展的工作流调度平台:
Scheduler: The scheduler monitors all DAGs and their associated tasks. It periodically checks active tasks to initiate.
调度(Scheduler):计划程序监视所有 DAG 及其关联的任务。它会定期检查要启动的活动任务。
Web server: The web server is Airflow’s user interface. It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like Google Cloud Storage, Microsoft Azure blobs, etc.
Database: The state of the DAGs and their associated tasks are saved in the database to ensure the schedule remembers metadata information. Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to connect to the metadata database. The scheduler examines all of the DAGs and stores pertinent information, like schedule intervals, statistics from each run, and task instances.
数据库(Database):DAG 及其关联任务的状态保存在数据库中,以确保计划记住元数据信息。 Airflow使用 SQLAlchemy和对象关系映射 (ORM) 连接到元数据数据库。调度程序检查所有 DAG 并存储相关信息,如计划间隔、每次运行的统计信息和任务实例。
Executor: There are different types of executors to use for different use cases.Examples of executors:
执行者(Executer):有不同类型的执行器可用于不同的用例。执行器示例:
SequentialExecutor: This executor can run a single task at any given time. It cannot run tasks in parallel. It’s helpful in testing or debugging situations.
CeleryExecutor: This executor is the favored way to run a distributed Airflow cluster.
CeleryExecutor:此执行器是运行分布式Airflow集群的首选方式。
KubernetesExecutor: This executor calls the KubernetesAPI to make temporary pods for each of the task instances to run.
KubernetesExecutor:此执行器调用 Kubernetes API 为每个要运行的任务实例创建临时 Pod。
So, how does Airflow work?
那么,Airflow是如何工作的呢?
Airflow examines all the DAGs in the background at a certain period.
Airflow在特定时间段内检查后台中的所有 DAG。
This period is set using the config and is equal to one second.
此时间段是使用配置设置的,等于一秒。
Task instances are instantiated for tasks that need to be performed, and their status is set to in the metadata database.processor_poll_intervalSCHEDULED
任务实例针对需要执行的任务进行实例化,其状态在元数据数据库中设置为。processor_poll_interval SCHEDULED
The schedule queries the database, retrieves tasks in the state, and distributes them to the executors.
计划查询数据库,检索处于该状态的任务,并将其分发给执行程序。
Then, the state of the task changes to .
然后,任务的状态将更改。
Those queued tasks are drawn from the queue by workers who execute them.
这些排队的任务由执行它们的工作人员从队列中提取。
When this happens, the task status changes to .SCHEDULEDQUEUEDRUNNING
发生这种情况时,任务状态将更改为 。SCHEDULEDQUEUEDRUNNING
When a task finishes, the worker will mark it as failed or finished, and then the scheduler updates the final status in the metadata database.
任务完成后,辅助角色会将其标记为_失败_或_已完成_,然后计划程序将更新元数据数据库中的最终状态。
Features(特征)
Easy to Use: If you have a bit of python knowledge, you are good to go and deploy on Airflow.
易于使用:如果你具备一点python知识,你会很高兴去部署Airflow。
Open Source: It is free and open-source with a lot of active users.
开源:它是免费的开源的,有很多活跃的用户。
Robust Integrations: It will give you ready to use operators so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
Amazing User Interface: You can monitor and manage your workflows. It will allow you to check the status of completed and ongoing tasks.
惊人的用户界面:您可以监视和管理工作流。它将允许您检查已完成和正在进行的任务的状态。
Principles (原则)
Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.