Skip to content

enable_elastic_disk property incorrectly mapped when making a request to Databricks #25232

@aru-trackunit

Description

@aru-trackunit

Apache Airflow version

2.2.2

What happened

When using apache-airflow-providers-databricks in version 2.2.0 I am sending a request to databricks to submit a job.
https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate -> api/2.0/jobs/runs/submit

Databricks is expecting a boolean on a property enable_elastic_disk while airflow-databricks-provider sends a string.

new_cluster = {
    "autoscale": {"min_workers": 2, "max_workers": 5},
    "spark_version": "10.4.x-scala2.12",
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
    },
    "enable_elastic_disk": True, 
    "driver_node_type_id": "r5a.large",
    "node_type_id": "c5a.xlarge",
    "cluster_source": "JOB",
}

And the property enable_elastic_disk is not set on databricks side. I did also the same request to databricks from a Postman and the property was set to true which means that the problem does not lie on databricks side.

{
    "name": "test",
    "tasks": [
        {
            "task_key": "test-task-key",
            "notebook_task": {
                "notebook_path": "path_to_notebook"
            },
            "new_cluster": {
                "autoscale": {"min_workers": 1, "max_workers": 2},
                "cluster_name": "",
                "spark_version": "10.4.x-scala2.12",    
                "aws_attributes": {
                    "first_on_demand": 1,
                    "availability": "SPOT_WITH_FALLBACK",
                    "zone_id": "auto",        
                    "spot_bid_price_percent": 100                    
                },
                "driver_node_type_id": "r5a.large",
                "node_type_id": "c5a.xlarge",
                "enable_elastic_disk": true,                     
                "cluster_source": "JOB"                
            }
        }
    ]    
}

I have tried to find the problem and it apparently is this line. Before executing the line enable_elastic_disk is True of type boolean but after it becomes a string 'True' which databricks does not parse.

self.json = deep_string_coerce(self.json)

What you think should happen instead

After setting property enable_elastic_disk it should be propagated into databricks but it's not.

How to reproduce

Try to run:

new_cluster = {
    "autoscale": {"min_workers": 2, "max_workers": 5},
    "spark_version": "10.4.x-scala2.12",
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
    },
    "enable_elastic_disk": True, 
    "driver_node_type_id": "r5a.large",
    "node_type_id": "c5a.xlarge",
    "cluster_source": "JOB",
}

notebook_task = {
    "notebook_path": f"/Repos/path_to_notebook"/main_asset_information",
    "base_parameters": {"env": env},
}


asset_information = DatabricksSubmitRunOperator(
        task_id="task_id"
        databricks_conn_id="databricks",
        new_cluster=new_cluster,
        notebook_task=notebook_task,
    )

Make sure airflow connection named databricks is set and check whether databricks has the property set.

After executing there is a need to check whether the property is set on databricks we can do it by using endpoint:
https://DATABRICKS_HOST/api/2.1/jobs/runs/get?run_id=123

Operating System

MWWA

Versions of Apache Airflow Providers

apache-airflow-providers-databricks in version 2.2.0

Deployment

MWAA

Deployment details

No response

Anything else

That's a permanent and repeatable problem. It would be great if this fix could be attached to lower versions for example 2.2.1, because I am not sure when AWS decides to upgrade to the latest airflow code and I am also not sure if installing higher versions of databricks provider on airflow 2.2.2 will not cause issues.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions