-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
2.2.2
What happened
When using apache-airflow-providers-databricks in version 2.2.0 I am sending a request to databricks to submit a job.
https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate -> api/2.0/jobs/runs/submit
Databricks is expecting a boolean on a property enable_elastic_disk while airflow-databricks-provider sends a string.
new_cluster = {
"autoscale": {"min_workers": 2, "max_workers": 5},
"spark_version": "10.4.x-scala2.12",
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"spot_bid_price_percent": 100,
},
"enable_elastic_disk": True,
"driver_node_type_id": "r5a.large",
"node_type_id": "c5a.xlarge",
"cluster_source": "JOB",
}
And the property enable_elastic_disk is not set on databricks side. I did also the same request to databricks from a Postman and the property was set to true which means that the problem does not lie on databricks side.
{
"name": "test",
"tasks": [
{
"task_key": "test-task-key",
"notebook_task": {
"notebook_path": "path_to_notebook"
},
"new_cluster": {
"autoscale": {"min_workers": 1, "max_workers": 2},
"cluster_name": "",
"spark_version": "10.4.x-scala2.12",
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"spot_bid_price_percent": 100
},
"driver_node_type_id": "r5a.large",
"node_type_id": "c5a.xlarge",
"enable_elastic_disk": true,
"cluster_source": "JOB"
}
}
]
}
I have tried to find the problem and it apparently is this line. Before executing the line enable_elastic_disk is True of type boolean but after it becomes a string 'True' which databricks does not parse.
| self.json = deep_string_coerce(self.json) |
What you think should happen instead
After setting property enable_elastic_disk it should be propagated into databricks but it's not.
How to reproduce
Try to run:
new_cluster = {
"autoscale": {"min_workers": 2, "max_workers": 5},
"spark_version": "10.4.x-scala2.12",
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"spot_bid_price_percent": 100,
},
"enable_elastic_disk": True,
"driver_node_type_id": "r5a.large",
"node_type_id": "c5a.xlarge",
"cluster_source": "JOB",
}
notebook_task = {
"notebook_path": f"/Repos/path_to_notebook"/main_asset_information",
"base_parameters": {"env": env},
}
asset_information = DatabricksSubmitRunOperator(
task_id="task_id"
databricks_conn_id="databricks",
new_cluster=new_cluster,
notebook_task=notebook_task,
)
Make sure airflow connection named databricks is set and check whether databricks has the property set.
After executing there is a need to check whether the property is set on databricks we can do it by using endpoint:
https://DATABRICKS_HOST/api/2.1/jobs/runs/get?run_id=123
Operating System
MWWA
Versions of Apache Airflow Providers
apache-airflow-providers-databricks in version 2.2.0
Deployment
MWAA
Deployment details
No response
Anything else
That's a permanent and repeatable problem. It would be great if this fix could be attached to lower versions for example 2.2.1, because I am not sure when AWS decides to upgrade to the latest airflow code and I am also not sure if installing higher versions of databricks provider on airflow 2.2.2 will not cause issues.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct