Skip to content

Adding persistence for partial job failure information#477

Closed
larsgeorge-db wants to merge 12 commits intomainfrom
fix/issue_406_persist_errors
Closed

Adding persistence for partial job failure information#477
larsgeorge-db wants to merge 12 commits intomainfrom
fix/issue_406_persist_errors

Conversation

@larsgeorge-db
Copy link
Copy Markdown
Contributor

@larsgeorge-db larsgeorge-db commented Oct 19, 2023

This pull request adds a new file failures.py which defines two classes, ObjectFailureError and ObjectFailure, and a class FailureReporter. The ObjectFailureError class is a custom exception that takes an object type, object ID, and a root cause exception as arguments. The ObjectFailure class is also a custom exception that takes the same arguments as ObjectFailureError, as well as several optional arguments for additional metadata. The FailureReporter class is used to report failures by persisting them in a table using a provided SqlBackend instance. The ObjectFailure instances are converted to strings and persisted using the save_table method of the SqlBackend instance.

In addition to the changes to failures.py, there are also changes to several existing files. In tasks.py, there are new global environment variables WORKFLOW_NAME_ENVKEY, TASK_NAME_ENVKEY, PARENT_RUN_ID_ENVKEY, and JOB_ID_ENVKEY added. These environment variables are used to set the task name, workflow name, parent run ID, and job ID when triggering a task. The trigger function is also updated to set these environment variables before running the task.

In grants.py, there are changes to the GrantsCrawler class. A new FailureReporter instance is created and stored as an instance variable. In the snapshot method, if there are any errors while crawling grants, they are passed to the report method of the FailureReporter instance.

In tables.py, there are changes to the TablesCrawler class. A new FailureReporter instance is created and stored as an instance variable. In the snapshot method, if there are any errors while crawling tables, they are passed to the report method of the FailureReporter instance. In the migrate_tables method, if there are any errors while migrating tables, they are passed to the report method of the FailureReporter instance.

In generic.py, there are changes to the _applier_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In redash.py, there are changes to the _applier_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In scim.py, there are changes to the _applier_task method. If there are any errors while applying properties, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In secrets.py, there are changes to the get_apply_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the get_apply_task method.

In test_failures.py, there are new integration tests added for testing the FailureReporter class. The tests create a FailureReporter instance and then use it to report a failure. The test then checks that the failure was persisted in the table by checking the rows written to the table.

In test_failures.py, there are new unit tests added for testing the FailureReporter class. The tests create a MockBackend instance and a FailureReporter instance. The tests then create a list of ObjectFailure instances and pass them to the report method of the FailureReporter instance. The test then checks that the failure records were written to the table by checking the rows written to the table using the rows_written_for method of the MockBackend instance.

Overall, this pull request adds new classes and methods for reporting and handling failures in the framework. It also updates several existing files to use the new failure reporting functionality. The changes include adding new instance variables, modifying existing methods to use the FailureReporter instance, and adding new tests

Resolves #406
Resolves #445

@larsgeorge-db larsgeorge-db requested a review from a team October 19, 2023 12:30
@larsgeorge-db larsgeorge-db marked this pull request as draft October 19, 2023 12:32
@larsgeorge-db larsgeorge-db requested a review from nfx October 20, 2023 06:25
@larsgeorge-db larsgeorge-db changed the title Adding persistence for partial job failures. Adding persistence for partial job failure information. Oct 21, 2023
@larsgeorge-db larsgeorge-db changed the title Adding persistence for partial job failure information. Adding persistence for partial job failure information Oct 21, 2023
@property
def query_key(self):
return f"{self.name}:query_id"
return f"{self.dashboard_ref}_{self.name}:query_id"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ":" as separator, to be consistent

self._state = json.load(self._ws.workspace.download(self._query_state))
to_remove = []
for k, v in self._state.items():
_, name = k.split(":")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may contain 3 elements now, account for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait what? Since when? Where?

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 21, 2023

Codecov Report

Attention: Patch coverage is 81.54762% with 31 lines in your changes are missing coverage. Please review.

Project coverage is 80.30%. Comparing base (334c9b8) to head (3afbd43).
Report is 300 commits behind head on main.

❗ Current head 3afbd43 differs from pull request most recent head 2b10e2b. Consider uploading reports for the commit 2b10e2b to get more accurate results

Files Patch % Lines
src/databricks/labs/ucx/hive_metastore/tables.py 70.83% 6 Missing and 1 partial ⚠️
...rc/databricks/labs/ucx/workspace_access/manager.py 28.57% 5 Missing ⚠️
...rc/databricks/labs/ucx/workspace_access/secrets.py 66.66% 4 Missing and 1 partial ⚠️
src/databricks/labs/ucx/framework/tasks.py 55.55% 4 Missing ⚠️
src/databricks/labs/ucx/workspace_access/scim.py 77.77% 2 Missing and 2 partials ⚠️
src/databricks/labs/ucx/framework/failures.py 96.72% 2 Missing ⚠️
...rc/databricks/labs/ucx/workspace_access/generic.py 84.61% 2 Missing ⚠️
src/databricks/labs/ucx/workspace_access/redash.py 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #477      +/-   ##
==========================================
+ Coverage   80.25%   80.30%   +0.05%     
==========================================
  Files          31       32       +1     
  Lines        3206     3316     +110     
  Branches      620      629       +9     
==========================================
+ Hits         2573     2663      +90     
- Misses        486      508      +22     
+ Partials      147      145       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# TODO: https://github.com/databrickslabs/ucx/issues/406
for _e in errors:
self._failure_reporter.report(ObjectFailure.make(_e))
self._failure_reporter.flush()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method flush() will be forgotten to get called

Suggested change
self._failure_reporter.flush()
self._failure_reporter.record(errors)

logger = logging.getLogger(__name__)


class ObjectFailureError(Exception):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ObjectFailureError(Exception):
class ObjectFailure(RuntimeWarning):

@dataclass
class ObjectFailure:
event_time: float = field(default_factory=time.time)
step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))
step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, None))

So that rows have NULL

step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))
task_name: str = field(default_factory=partial(os.environ.get, TASK_NAME_ENVKEY, "n/a"))
parent_run_id: str = field(default_factory=partial(os.environ.get, PARENT_RUN_ID_ENVKEY, "n/a"))
job_id: str = field(default_factory=partial(os.environ.get, JOB_ID_ENVKEY, "n/a"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better design would be to inject failure reporter with pre-filled task/run ids and pass it down via constructor. It's an unclear data flow with environment variables that are not guaranteed to be set

job_id: str = field(default_factory=partial(os.environ.get, JOB_ID_ENVKEY, "n/a"))
object_type: str | None = None
object_id: str | None = None
error_info: str | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error_info: str | None = None
message: str

Message is always there

with self._events_lock:
self._buffer.append(failure)

def flush(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush a list of failures at once, don't batch them

_buffer: ClassVar[list[ObjectFailure]] = []
_events_lock = threading.Lock()

def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str = "failures"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str = "failures"):
def __init__(self, backend: SqlBackend, schema: str):

Having too many things to configure is bad as well

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 27, 2023

CLA assistant check
All committers have signed the CLA.

@nfx nfx added the tech debt chores and design flaws label Mar 23, 2024
@nfx
Copy link
Copy Markdown
Collaborator

nfx commented Mar 28, 2024

Closing this PR in favor of #1148

Modifying code places won't catch 100% errors and warnings, but #1148 will, as it's a fundamentally better approach with a better UC.

This PR has not seen a progress in 5 months and unlikely to complete

@nfx nfx closed this Mar 28, 2024
@nfx nfx deleted the fix/issue_406_persist_errors branch April 4, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tech debt chores and design flaws

Projects

None yet

3 participants