Adding persistence for partial job failure information by larsgeorge-db · Pull Request #477 · databrickslabs/ucx

larsgeorge-db · 2023-10-19T12:30:42Z

This pull request adds a new file failures.py which defines two classes, ObjectFailureError and ObjectFailure, and a class FailureReporter. The ObjectFailureError class is a custom exception that takes an object type, object ID, and a root cause exception as arguments. The ObjectFailure class is also a custom exception that takes the same arguments as ObjectFailureError, as well as several optional arguments for additional metadata. The FailureReporter class is used to report failures by persisting them in a table using a provided SqlBackend instance. The ObjectFailure instances are converted to strings and persisted using the save_table method of the SqlBackend instance.

In addition to the changes to failures.py, there are also changes to several existing files. In tasks.py, there are new global environment variables WORKFLOW_NAME_ENVKEY, TASK_NAME_ENVKEY, PARENT_RUN_ID_ENVKEY, and JOB_ID_ENVKEY added. These environment variables are used to set the task name, workflow name, parent run ID, and job ID when triggering a task. The trigger function is also updated to set these environment variables before running the task.

In grants.py, there are changes to the GrantsCrawler class. A new FailureReporter instance is created and stored as an instance variable. In the snapshot method, if there are any errors while crawling grants, they are passed to the report method of the FailureReporter instance.

In tables.py, there are changes to the TablesCrawler class. A new FailureReporter instance is created and stored as an instance variable. In the snapshot method, if there are any errors while crawling tables, they are passed to the report method of the FailureReporter instance. In the migrate_tables method, if there are any errors while migrating tables, they are passed to the report method of the FailureReporter instance.

In generic.py, there are changes to the _applier_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In redash.py, there are changes to the _applier_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In scim.py, there are changes to the _applier_task method. If there are any errors while applying properties, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the _applier_task method.

In secrets.py, there are changes to the get_apply_task method. If there are any errors while applying permissions, they are passed to the report method of the FailureReporter instance. The FailureReporter instance is passed as an argument to the get_apply_task method.

In test_failures.py, there are new integration tests added for testing the FailureReporter class. The tests create a FailureReporter instance and then use it to report a failure. The test then checks that the failure was persisted in the table by checking the rows written to the table.

In test_failures.py, there are new unit tests added for testing the FailureReporter class. The tests create a MockBackend instance and a FailureReporter instance. The tests then create a list of ObjectFailure instances and pass them to the report method of the FailureReporter instance. The test then checks that the failure records were written to the table by checking the rows written to the table using the rows_written_for method of the MockBackend instance.

Overall, this pull request adds new classes and methods for reporting and handling failures in the framework. It also updates several existing files to use the new failure reporting functionality. The changes include adding new instance variables, modifying existing methods to use the FailureReporter instance, and adding new tests

Resolves #406
Resolves #445

src/databricks/labs/ucx/framework/logger.py

src/databricks/labs/ucx/hive_metastore/tables.py

nfx · 2023-10-21T14:09:03Z

src/databricks/labs/ucx/framework/dashboards.py

    @property
    def query_key(self):
-        return f"{self.name}:query_id"
+        return f"{self.dashboard_ref}_{self.name}:query_id"


Use ":" as separator, to be consistent

nfx · 2023-10-21T14:11:27Z

src/databricks/labs/ucx/framework/dashboards.py

            self._state = json.load(self._ws.workspace.download(self._query_state))
            to_remove = []
            for k, v in self._state.items():
+                _, name = k.split(":")


This may contain 3 elements now, account for it

Wait what? Since when? Where?

src/databricks/labs/ucx/framework/failures.py

src/databricks/labs/ucx/framework/logger.py

codecov · 2023-10-21T16:56:58Z

Codecov Report

Attention: Patch coverage is 81.54762% with 31 lines in your changes are missing coverage. Please review.

Project coverage is 80.30%. Comparing base (334c9b8) to head (3afbd43).
Report is 300 commits behind head on main.

❗ Current head 3afbd43 differs from pull request most recent head 2b10e2b. Consider uploading reports for the commit 2b10e2b to get more accurate results

Files	Patch %	Lines
src/databricks/labs/ucx/hive_metastore/tables.py	70.83%	6 Missing and 1 partial ⚠️
...rc/databricks/labs/ucx/workspace_access/manager.py	28.57%	5 Missing ⚠️
...rc/databricks/labs/ucx/workspace_access/secrets.py	66.66%	4 Missing and 1 partial ⚠️
src/databricks/labs/ucx/framework/tasks.py	55.55%	4 Missing ⚠️
src/databricks/labs/ucx/workspace_access/scim.py	77.77%	2 Missing and 2 partials ⚠️
src/databricks/labs/ucx/framework/failures.py	96.72%	2 Missing ⚠️
...rc/databricks/labs/ucx/workspace_access/generic.py	84.61%	2 Missing ⚠️
src/databricks/labs/ucx/workspace_access/redash.py	83.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #477      +/-   ##
==========================================
+ Coverage   80.25%   80.30%   +0.05%     
==========================================
  Files          31       32       +1     
  Lines        3206     3316     +110     
  Branches      620      629       +9     
==========================================
+ Hits         2573     2663      +90     
- Misses        486      508      +22     
+ Partials      147      145       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nfx · 2023-10-21T18:21:36Z

src/databricks/labs/ucx/hive_metastore/grants.py

-            # TODO: https://github.com/databrickslabs/ucx/issues/406
+            for _e in errors:
+                self._failure_reporter.report(ObjectFailure.make(_e))
+            self._failure_reporter.flush()


Method flush() will be forgotten to get called

Suggested change

self._failure_reporter.flush()

self._failure_reporter.record(errors)

nfx · 2023-10-21T18:24:31Z

src/databricks/labs/ucx/framework/failures.py

+logger = logging.getLogger(__name__)
+
+
+class ObjectFailureError(Exception):


Suggested change

class ObjectFailureError(Exception):

class ObjectFailure(RuntimeWarning):

nfx · 2023-10-21T18:25:45Z

src/databricks/labs/ucx/framework/failures.py

+@dataclass
+class ObjectFailure:
+    event_time: float = field(default_factory=time.time)
+    step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))


Suggested change

step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))

step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, None))

So that rows have NULL

nfx · 2023-10-21T18:28:48Z

src/databricks/labs/ucx/framework/failures.py

+    step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))
+    task_name: str = field(default_factory=partial(os.environ.get, TASK_NAME_ENVKEY, "n/a"))
+    parent_run_id: str = field(default_factory=partial(os.environ.get, PARENT_RUN_ID_ENVKEY, "n/a"))
+    job_id: str = field(default_factory=partial(os.environ.get, JOB_ID_ENVKEY, "n/a"))


Better design would be to inject failure reporter with pre-filled task/run ids and pass it down via constructor. It's an unclear data flow with environment variables that are not guaranteed to be set

nfx · 2023-10-21T18:29:27Z

src/databricks/labs/ucx/framework/failures.py

+    job_id: str = field(default_factory=partial(os.environ.get, JOB_ID_ENVKEY, "n/a"))
+    object_type: str | None = None
+    object_id: str | None = None
+    error_info: str | None = None


Suggested change

error_info: str | None = None

message: str

Message is always there

nfx · 2023-10-21T18:31:08Z

src/databricks/labs/ucx/framework/failures.py

+        with self._events_lock:
+            self._buffer.append(failure)
+
+    def flush(self):


Flush a list of failures at once, don't batch them

nfx · 2023-10-21T18:32:20Z

src/databricks/labs/ucx/framework/failures.py

+    _buffer: ClassVar[list[ObjectFailure]] = []
+    _events_lock = threading.Lock()
+
+    def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str = "failures"):


Suggested change

def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str = "failures"):

def __init__(self, backend: SqlBackend, schema: str):

Having too many things to configure is bad as well

CLAassistant · 2023-11-27T20:19:43Z

All committers have signed the CLA.

nfx · 2024-03-28T08:31:26Z

Closing this PR in favor of #1148

Modifying code places won't catch 100% errors and warnings, but #1148 will, as it's a fundamentally better approach with a better UC.

This PR has not seen a progress in 5 months and unlikely to complete

larsgeorge-db requested a review from a team October 19, 2023 12:30

larsgeorge-db had a problem deploying to account-admin October 19, 2023 12:30 — with GitHub Actions Failure

larsgeorge-db marked this pull request as draft October 19, 2023 12:32

nfx requested changes Oct 19, 2023

View reviewed changes

larsgeorge-db had a problem deploying to account-admin October 19, 2023 16:23 — with GitHub Actions Failure

larsgeorge-db had a problem deploying to account-admin October 19, 2023 16:26 — with GitHub Actions Failure

larsgeorge-db had a problem deploying to account-admin October 19, 2023 16:31 — with GitHub Actions Failure

larsgeorge-db had a problem deploying to account-admin October 19, 2023 19:41 — with GitHub Actions Failure

larsgeorge-db had a problem deploying to account-admin October 19, 2023 19:44 — with GitHub Actions Failure

larsgeorge-db mentioned this pull request Oct 20, 2023

Added more views to assessment dashboard #474

Merged

larsgeorge-db requested a review from nfx October 20, 2023 06:25

larsgeorge-db had a problem deploying to account-admin October 21, 2023 10:59 — with GitHub Actions Failure

larsgeorge-db changed the title ~~Adding persistence for partial job failures.~~ Adding persistence for partial job failure information. Oct 21, 2023

larsgeorge-db changed the title ~~Adding persistence for partial job failure information.~~ Adding persistence for partial job failure information Oct 21, 2023

larsgeorge-db had a problem deploying to account-admin October 21, 2023 11:06 — with GitHub Actions Failure

nfx requested changes Oct 21, 2023

View reviewed changes

larsgeorge-db had a problem deploying to account-admin October 21, 2023 16:55 — with GitHub Actions Failure

larsgeorge-db added 9 commits October 21, 2023 19:20

Adding persistence for partial job failures.

7664dd1

..

b42703a

..

1fe0e2f

..

366e66d

..

1bc25db

..

5fcb29d

..

9891c35

..

ddfebbb

..

3099cdd

larsgeorge-db force-pushed the fix/issue_406_persist_errors branch from 3fb2e31 to 3099cdd Compare October 21, 2023 17:20

larsgeorge-db had a problem deploying to account-admin October 21, 2023 17:20 — with GitHub Actions Failure

..

5c3feae

larsgeorge-db had a problem deploying to account-admin October 21, 2023 17:24 — with GitHub Actions Failure

..

3afbd43

larsgeorge-db had a problem deploying to account-admin October 21, 2023 17:25 — with GitHub Actions Failure

larsgeorge-db requested a review from nfx October 21, 2023 17:26

nfx requested changes Oct 21, 2023

View reviewed changes

Refactored to simple invocation with error list

2b10e2b

larsgeorge-db had a problem deploying to account-admin October 27, 2023 08:00 — with GitHub Actions Failure

nfx added the tech debt chores and design flaws label Mar 23, 2024

nfx closed this Mar 28, 2024

nfx deleted the fix/issue_406_persist_errors branch April 4, 2024 22:31

	self._failure_reporter.flush()
	self._failure_reporter.record(errors)

		logger = logging.getLogger(__name__)


		class ObjectFailureError(Exception):

	class ObjectFailureError(Exception):
	class ObjectFailure(RuntimeWarning):

	step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, "n/a"))
	step_name: str = field(default_factory=partial(os.environ.get, WORKFLOW_NAME_ENVKEY, None))

	def __init__(self, backend: SqlBackend, catalog: str, schema: str, table: str = "failures"):
	def __init__(self, backend: SqlBackend, schema: str):

Conversation

larsgeorge-db commented Oct 19, 2023 • edited by nfx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nfx commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larsgeorge-db commented Oct 19, 2023 •

edited by nfx

Loading

codecov bot commented Oct 21, 2023 •

edited

Loading

CLAassistant commented Nov 27, 2023 •

edited

Loading

nfx commented Mar 28, 2024 •

edited

Loading