[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail#53038
[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail#53038potiuk merged 1 commit intoapache:v2-11-testfrom
Conversation
|
tests are failing: |
|
@kaxil all tests are passing now. |
Yeah, let's get that one merged first |
|
I see #53435 is merged |
|
@eladkal I had to make some changes there. Let me see if the changes are needed here as well. |
@karenbraganz -> did you check it ? |
|
@potiuk There are a few changes that I need to make. Working on the changes now. |
|
I still need to run some tests to ensure my changes work as expected. |
|
Just need to figure out why one of the checks is failing. I think it might have something to do with the rich-click package version being used. Everything else is completed. |
|
I had to make changes to airflow/www/static/js/types/api-generated.ts because one of the CI hooks was failing. My PR is not related to the UI. |
|
Going to close and re-open the PR to re-trigger tests. |
|
@karenbraganz can you rebase and resolve conflicts? |
1ffb9c0 to
875e831
Compare
|
I rebased your changes @karenbraganz -> @kaxil maybe you want to have a look before I merge it. |
1a78efc to
9c67a68
Compare
…ailure callbacks This commit addresses the handling of tasks that remain stuck in the queued state beyond the configured retry threshold. Previously, these tasks were marked as failed in the database but the executor was not properly notified, leading to inconsistent state between Airflow and the executor. Changes made: - Modified _maybe_requeue_stuck_ti() to accept executor parameter and call executor.fail() when tasks exceed requeue attempts, ensuring the executor is notified of task failures - Added logic to retrieve the DAG and task object to check for on_failure_callback - When a failure callback exists, create a TaskCallbackRequest and send it via executor to ensure failure callbacks are invoked for stuck queued tasks - Updated tests to verify that executor.fail() is called and callbacks are sent appropriately - Added test fixtures with mock_failure_callback to validate callback invocation - Changed pyproject.toml Python version requirement (unrelated) Why this matters: Tasks stuck in queued state can now be properly cleaned up at both the Airflow scheduler and executor levels, preventing resource leaks and ensuring failure callbacks are executed for proper error handling and alerting. Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
9c67a68 to
abcedd7
Compare
|
Nicely green now... Merging. |
…ailure callbacks (#53038) This commit addresses the handling of tasks that remain stuck in the queued state beyond the configured retry threshold. Previously, these tasks were marked as failed in the database but the executor was not properly notified, leading to inconsistent state between Airflow and the executor. Changes made: - Modified _maybe_requeue_stuck_ti() to accept executor parameter and call executor.fail() when tasks exceed requeue attempts, ensuring the executor is notified of task failures - Added logic to retrieve the DAG and task object to check for on_failure_callback - When a failure callback exists, create a TaskCallbackRequest and send it via executor to ensure failure callbacks are invoked for stuck queued tasks - Updated tests to verify that executor.fail() is called and callbacks are sent appropriately - Added test fixtures with mock_failure_callback to validate callback invocation - Changed pyproject.toml Python version requirement (unrelated) Why this matters: Tasks stuck in queued state can now be properly cleaned up at both the Airflow scheduler and executor levels, preventing resource leaks and ensuring failure callbacks are executed for proper error handling and alerting. Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
closes: #51301
In issues #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520. In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them.
Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR
ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.