-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Closed
Labels
kind:featureFeature RequestsFeature Requests
Description
Description
The databricks provider can be updated to a deferrable task introduced in airflow 2.2.0. This will significnatly reduce cpu usage, memory usage on LocalExecutor and improve reliability since state is stored in the airflow metastore.
Use case/motivation
Currently the databricks operators works by calling the databricks REST API to submit a spark job and polling it every 20 secs to check when it is done. This creates the following inefficiencies:
- If the airflow executor process crashes, a duplicate job can be created in databricks since airflow doesn't save the databricks job run id.
- Each task instance of the databricks operators run its own process, however 95% of the time the process is idle and waiting to repoll the databricks API. If you want to 50 jobs in parallel, you will use a non trivial amount of memory / CPU just to poll a rest API.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
kind:featureFeature RequestsFeature Requests