Skip to content

Fix injected failures not recoverable from on retry#95

Merged
gpauloski merged 4 commits intomainfrom
failures-fix
Jul 17, 2024
Merged

Fix injected failures not recoverable from on retry#95
gpauloski merged 4 commits intomainfrom
failures-fix

Conversation

@gpauloski
Copy link
Copy Markdown
Contributor

Description

Previously, a task was predetermined to fail before submission which meant executors with retry semantics could never recover. This changes failure prob to be computed in the task during execution.

Also adds/updates tests to work towards #65.

Fixes N/A

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (internal implementation changes)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update (no changes to the code)
  • CI change (changes to CI workflows, packages, templates, etc.)
  • Version changes (changes to the package or dependency versions)

Testing

Added unit tests and tested with this config.

[app]
name = "failures"
base = "cholesky"
failure_rate = 0.25
failure_type = "dependency"

[app.config]
matrix_size = 100
block_size = 50

[engine.executor]
name = "parsl-htex"
retries = 3

[engine.executor.htex]
max_workers_per_node = 8

Pull Request Checklist

Please confirm the PR meets the following requirements.

  • Relevant tags are added (breaking, bug, dependencies, documentation, enhancement, refactor).
  • Code changes pass pre-commit (e.g., ruff, mypy, etc.).
  • Tests have been added to show the fix is effective or that the new feature works.
  • New and existing unit tests pass locally with the changes.
  • Docs have been updated and reviewed if relevant.

The failures app had two errors: a bad import path for Engine after the
recent refactoring and calling close on the engine twice (once in the
_FailureInjectionEngine and once by the caller of
FailureInjectionApp.run()).

I also introduced in error in the original PR for this app. The failure
of a task was determined statically by the engine, rather than during
task execution. This means that a task that was selected to fail would
always fail even when retried by an executor. I used the following
config to validate the new changes work. There will occasionally be
parsl log messages indicating a task failed and is retried.

[app]
name = "failures"
base = "cholesky"
failure_rate = 0.25
failure_type = "import"

[app.config]
matrix_size = 100
block_size = 50

[engine.executor]
name = "parsl-htex"
retries = 3

[engine.executor.htex]
max_workers_per_node = 8
@gpauloski gpauloski added the bug Something isn't working label Jul 17, 2024
@gpauloski gpauloski merged commit ea7b7da into main Jul 17, 2024
@gpauloski gpauloski deleted the failures-fix branch July 17, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant