Skip to content

Update databricks provider to use TriggerOperator #18999

@chinwobble

Description

@chinwobble

Description

The databricks provider can be updated to a deferrable task introduced in airflow 2.2.0. This will significnatly reduce cpu usage, memory usage on LocalExecutor and improve reliability since state is stored in the airflow metastore.

Use case/motivation

Currently the databricks operators works by calling the databricks REST API to submit a spark job and polling it every 20 secs to check when it is done. This creates the following inefficiencies:

  • If the airflow executor process crashes, a duplicate job can be created in databricks since airflow doesn't save the databricks job run id.
  • Each task instance of the databricks operators run its own process, however 95% of the time the process is idle and waiting to repoll the databricks API. If you want to 50 jobs in parallel, you will use a non trivial amount of memory / CPU just to poll a rest API.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions