Skip to content

Added databricks labs ucx repair-run --step ... CLI command for repair run of any failed workflows, like assessment, migrate-groups etc.#724

Merged
nfx merged 14 commits intomainfrom
feature/retry_run_logic_cli
Dec 28, 2023
Merged

Added databricks labs ucx repair-run --step ... CLI command for repair run of any failed workflows, like assessment, migrate-groups etc.#724
nfx merged 14 commits intomainfrom
feature/retry_run_logic_cli

Conversation

@prajin-29
Copy link
Copy Markdown
Contributor

This may close #658

@prajin-29 prajin-29 requested review from a team and priyal-c December 22, 2023 02:28
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (24d4acb) 79.45% compared to head (5bc0dc8) 79.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #724      +/-   ##
==========================================
+ Coverage   79.45%   79.58%   +0.13%     
==========================================
  Files          42       42              
  Lines        4517     4546      +29     
  Branches      845      849       +4     
==========================================
+ Hits         3589     3618      +29     
  Misses        710      710              
  Partials      218      218              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • change the title of the pull request to be changelog-ready. See other commit messages on main branch for consistency examples

Comment thread src/databricks/labs/ucx/install.py Outdated

def repair_run(self, workflow):
try:
job_id_list = [job_id for step, job_id in self._state.jobs.items() if workflow == step]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one job id per step, use job_id = state.jobs.get(step), not the list comprehension

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed accordingly

Comment thread src/databricks/labs/ucx/install.py Outdated
if job_id_list:
job_id = job_id_list[0]
job_runs = list(self._ws.jobs.list_runs(job_id=job_id, limit=1))
state = job_runs[0].state
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: will crash when there are no runs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code. If there is no runs it will throw a warning message

Comment thread src/databricks/labs/ucx/install.py Outdated
job_runs = list(self._ws.jobs.list_runs(job_id=job_id, limit=1))
state = job_runs[0].state
if job_runs and state.result_state.value != "SUCCESS":
run_id = job_runs[0].run_id
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract first run into a variable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code now.

Comment thread src/databricks/labs/ucx/install.py Outdated
if job_runs and state.result_state.value != "SUCCESS":
run_id = job_runs[0].run_id
self._ws.jobs.repair_run(run_id=run_id, rerun_all_failed_tasks=True)
logger.info("Exception")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for testing.Removed now from the code

Comment thread src/databricks/labs/ucx/install.py Outdated
def repair_run(self, workflow):
try:
job_id_list = [job_id for step, job_id in self._state.jobs.items() if workflow == step]
if job_id_list:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you invert the logic a bit: has to be "if job id is not found, show a warning and return immediately". Don't delay that to an else statement. See contributing.md - I think i added explanation why

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverted the logic.

@nfx
Copy link
Copy Markdown
Collaborator

nfx commented Dec 22, 2023

Can you also add an integration test?

Comment thread src/databricks/labs/ucx/install.py Outdated
job_runs = list(self._ws.jobs.list_runs(job_id=job_id, limit=1))
if not job_runs:
logger.warning(f"{workflow} job is not initialized yet. Can't trigger repair run now")
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need "else" here, just de-dent the next code block

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed according

Comment thread src/databricks/labs/ucx/install.py
Comment thread tests/integration/test_installation.py Outdated
@retried(on=[NotFound, InvalidParameterValue, OperationFailed], timeout=timedelta(minutes=10))
def test_repair_run_assessment_job(ws, new_installation, caplog):
install = new_installation()
install.run_workflow("assessment")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to "break" the destroy database workflow, where you can inject failure by removing the database before starting the job

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used destroy database workflow to test the repair run in Integration test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-12-28 at 2 52 59 PM

@nfx
Copy link
Copy Markdown
Collaborator

nfx commented Dec 27, 2023

Make pull request title changelog-friendly

@prajin-29 prajin-29 changed the title Adding cli command for repair run Added New CLI command to repair run a workflow Dec 28, 2023
Comment thread src/databricks/labs/ucx/install.py Outdated
if state.result_state.value != "SUCCESS":
run_id = latest_job_run.run_id
job_run_waiter = self._ws.jobs.repair_run(run_id=run_id, rerun_all_failed_tasks=True)
job_run_waiter.result()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would timeout in 20 minutes. What you should do instead is construct a url of this new run and open it in browser and return from cli

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed calling the result.Instead opening the run url in browser.

# Conflicts:
#	labs.yml
#	src/databricks/labs/ucx/cli.py
@prajin-29 prajin-29 changed the title Added New CLI command to repair run a workflow Added New CLI command to repair run any Failed workflows(assessment,migrate-groups etc.) Dec 28, 2023
@prajin-29 prajin-29 changed the title Added New CLI command to repair run any Failed workflows(assessment,migrate-groups etc.) Added databricks labs ucx repair-run --step ... to CLI commands for repair running any failed workflows(assessment,migrate-groups etc) Dec 28, 2023
@nfx nfx changed the title Added databricks labs ucx repair-run --step ... to CLI commands for repair running any failed workflows(assessment,migrate-groups etc) Added databricks labs ucx repair-run --step ... CLI command for repair run of any failed workflows, like assessment, migrate-groups etc. Dec 28, 2023
@prajin-29
Copy link
Copy Markdown
Contributor Author

@nfx All the comments has been implemented.Kindly review the same.

Copy link
Copy Markdown
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@nfx nfx merged commit 5ff8bc1 into main Dec 28, 2023
@nfx nfx deleted the feature/retry_run_logic_cli branch December 28, 2023 16:02
nfx added a commit that referenced this pull request Dec 28, 2023
* Added `databricks labs ucx repair-run --step ...` CLI command for repair run of any failed workflows, like `assessment`, `migrate-groups` etc. ([#724](#724)).
* Added `databricks labs ucx revert-migrated-table` command ([#729](#729)).
* Allow specifying a group list when group match options are used ([#725](#725)).
* Fixed installation issue when upgrading from an older version of the tool and improve logs ([#740](#740)).
* Renamed summary panel from Failure Summary to Assessment Summary ([#733](#733)).
* Retry internal error when getting permissions and update legacy table ACL documentation ([#728](#728)).
* Speedup installer execution ([#727](#727)).
@nfx nfx mentioned this pull request Dec 28, 2023
nfx added a commit that referenced this pull request Dec 28, 2023
* Added `databricks labs ucx repair-run --step ...` CLI command for
repair run of any failed workflows, like `assessment`, `migrate-groups`
etc. ([#724](#724)).
* Added `databricks labs ucx revert-migrated-table` command
([#729](#729)).
* Allow specifying a group list when group match options are used
([#725](#725)).
* Fixed installation issue when upgrading from an older version of the
tool and improve logs
([#740](#740)).
* Renamed summary panel from Failure Summary to Assessment Summary
([#733](#733)).
* Retry internal error when getting permissions and update legacy table
ACL documentation
([#728](#728)).
* Speedup installer execution
([#727](#727)).
HariGS-DB pushed a commit that referenced this pull request Jan 2, 2024
…air run of any failed workflows, like `assessment`, `migrate-groups` etc. (#724)
HariGS-DB pushed a commit that referenced this pull request Jan 2, 2024
* Added `databricks labs ucx repair-run --step ...` CLI command for
repair run of any failed workflows, like `assessment`, `migrate-groups`
etc. ([#724](#724)).
* Added `databricks labs ucx revert-migrated-table` command
([#729](#729)).
* Allow specifying a group list when group match options are used
([#725](#725)).
* Fixed installation issue when upgrading from an older version of the
tool and improve logs
([#740](#740)).
* Renamed summary panel from Failure Summary to Assessment Summary
([#733](#733)).
* Retry internal error when getting permissions and update legacy table
ACL documentation
([#728](#728)).
* Speedup installer execution
([#727](#727)).
FastLee pushed a commit that referenced this pull request Jan 19, 2024
…air run of any failed workflows, like `assessment`, `migrate-groups` etc. (#724)
FastLee pushed a commit that referenced this pull request Jan 19, 2024
* Added `databricks labs ucx repair-run --step ...` CLI command for
repair run of any failed workflows, like `assessment`, `migrate-groups`
etc. ([#724](#724)).
* Added `databricks labs ucx revert-migrated-table` command
([#729](#729)).
* Allow specifying a group list when group match options are used
([#725](#725)).
* Fixed installation issue when upgrading from an older version of the
tool and improve logs
([#740](#740)).
* Renamed summary panel from Failure Summary to Assessment Summary
([#733](#733)).
* Retry internal error when getting permissions and update legacy table
ACL documentation
([#728](#728)).
* Speedup installer execution
([#727](#727)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CLI command to repair run of the assessment job

2 participants