Skip to content

Introduce ObjectFailure with step, object_type and object_id fields to find the causes of failures quicker for different stages of the workflow #445

@nfx

Description

@nfx

For example, currently there's a failure with cluster policy retrieval now:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/framework/tasks.py", line 143, in trigger
    current_task.fn(cfg)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/runtime.py", line 153, in assess_azure_service_principals
    crawler.snapshot()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/assessment/crawlers.py", line 369, in snapshot
    return self._snapshot(self._try_fetch, self._crawl)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/framework/crawlers.py", line 244, in _snapshot
    loaded_records = list(loader())
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/assessment/crawlers.py", line 182, in _crawl
    all_relevant_service_principals = self._get_relevant_service_principals()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/assessment/crawlers.py", line 277, in _get_relevant_service_principals
    temp_list = self._list_all_jobs_with_spn_in_spark_conf()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/assessment/crawlers.py", line 297, in _list_all_jobs_with_spn_in_spark_conf
    policy = self._ws.cluster_policies.get(cluster_config.policy_id)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/service/compute.py", line 3589, in get
    res = self._api.do('GET', '/api/2.0/policies/clusters/get', query=query, headers=headers)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/core.py", line 1061, in do
    return retryable(self._perform)(method,
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/retries.py", line 47, in wrapper
    raise err
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/retries.py", line 29, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/core.py", line 1150, in _perform
    raise self._make_nicer_error(response=response, **payload) from None
databricks.sdk.core.DatabricksError: Can't find a cluster policy with id: XXXXX.

but we don't see if it's policy issue with a certain cluster or job or pipeline and we don't know which pipeline. We need to refactor any exceptions for #406 to have a good effect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cloud/azureissues related to AzureenhancementNew feature or requestmigrate/clustersgo/uc/upgrade Upgrade Interactive Clustersstep/assessmentgo/uc/upgrade - Assessment Steptech debtchores and design flaws

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions