Skip to content

[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail#53038

Merged
potiuk merged 1 commit intoapache:v2-11-testfrom
karenbraganz:tqt-callbacks-2-11
Feb 13, 2026
Merged

[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail#53038
potiuk merged 1 commit intoapache:v2-11-testfrom
karenbraganz:tqt-callbacks-2-11

Conversation

@karenbraganz
Copy link
Collaborator

@karenbraganz karenbraganz commented Jul 8, 2025

closes: #51301
In issues #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520. In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them.

Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@karenbraganz karenbraganz self-assigned this Jul 8, 2025
@kaxil
Copy link
Member

kaxil commented Jul 10, 2025

tests are failing:

FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_handle_stuck_queued_tasks_multiple_attempts - AssertionError: assert equals failed
  [            [           
    'queued',    'failed', 
    'queued',    'failed', 
  ]            ]
FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_handle_stuck_queued_tasks_reschedule_sensors - AssertionError: Expected 'fail' to not have been called. Called 2 times.
Calls: [call(TaskInstanceKey(dag_id='test_fail_stuck_queued_tasks_multiple_executors', task_id='op1', run_id='45e7135c-8e2c-4e07-8d35-4bea8c0232d6', try_number=0, map_index=-1)),
 call(TaskInstanceKey(dag_id='test_fail_stuck_queued_tasks_multiple_executors', task_id='op2', run_id='45e7135c-8e2c-4e07-8d35-4bea8c0232d6', try_number=0, map_index=-1))].

pytest introspection follows:

Args:
assert equals failed
  (                                ()                              
    TaskInstanceKey(dag_id='test_                                  
  fail_stuck_queued_tasks_multipl                                  
  e_executors', task_id='op2', ru                                  
  n_id='45e7135c-8e2c-4e07-8d35-4                                  
  bea8c0232d6', try_number=0, map                                  
  _index=-1),                                                      
  )
=========== 2 failed, 2035 passed, 959 skipped in 655.53s (0:10:55) ============

@eladkal eladkal added this to the Airflow 2.11.1 milestone Jul 11, 2025
@karenbraganz
Copy link
Collaborator Author

@kaxil all tests are passing now.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - providing that #53435 will also be merged :)

@kaxil
Copy link
Member

kaxil commented Jul 22, 2025

LGTM - providing that #53435 will also be merged :)

Yeah, let's get that one merged first

@karenbraganz karenbraganz changed the title [v2-11-test] Call executor fail method for stuck in queued tasks [v2-11-test] Allow failure callbacks for stuck in queued TIs that fail Aug 8, 2025
@eladkal
Copy link
Contributor

eladkal commented Sep 18, 2025

I see #53435 is merged

@karenbraganz
Copy link
Collaborator Author

@eladkal I had to make some changes there. Let me see if the changes are needed here as well.

@potiuk
Copy link
Member

potiuk commented Oct 13, 2025

@eladkal I had to make some changes there. Let me see if the changes are needed here as well.

@karenbraganz -> did you check it ?

@karenbraganz
Copy link
Collaborator Author

@potiuk There are a few changes that I need to make. Working on the changes now.

@karenbraganz
Copy link
Collaborator Author

I still need to run some tests to ensure my changes work as expected.

@karenbraganz
Copy link
Collaborator Author

Just need to figure out why one of the checks is failing. I think it might have something to do with the rich-click package version being used.

Everything else is completed.

@karenbraganz karenbraganz added the legacy ui Whether legacy UI change should be allowed in PR label Oct 27, 2025
@karenbraganz
Copy link
Collaborator Author

karenbraganz commented Oct 27, 2025

I had to make changes to airflow/www/static/js/types/api-generated.ts because one of the CI hooks was failing. My PR is not related to the UI.

@karenbraganz
Copy link
Collaborator Author

Going to close and re-open the PR to re-trigger tests.

@eladkal
Copy link
Contributor

eladkal commented Feb 9, 2026

@karenbraganz can you rebase and resolve conflicts?
cc @potiuk another candidate for 2.11.1

@potiuk potiuk force-pushed the tqt-callbacks-2-11 branch from 1ffb9c0 to 875e831 Compare February 11, 2026 14:35
@potiuk
Copy link
Member

potiuk commented Feb 11, 2026

I rebased your changes @karenbraganz -> @kaxil maybe you want to have a look before I merge it.

@potiuk potiuk linked an issue Feb 11, 2026 that may be closed by this pull request
2 tasks
@potiuk potiuk force-pushed the tqt-callbacks-2-11 branch 2 times, most recently from 1a78efc to 9c67a68 Compare February 13, 2026 11:28
…ailure callbacks

This commit addresses the handling of tasks that remain stuck in the queued state
beyond the configured retry threshold. Previously, these tasks were marked as failed
in the database but the executor was not properly notified, leading to inconsistent
state between Airflow and the executor.

Changes made:
- Modified _maybe_requeue_stuck_ti() to accept executor parameter and call executor.fail()
  when tasks exceed requeue attempts, ensuring the executor is notified of task failures
- Added logic to retrieve the DAG and task object to check for on_failure_callback
- When a failure callback exists, create a TaskCallbackRequest and send it via executor
  to ensure failure callbacks are invoked for stuck queued tasks
- Updated tests to verify that executor.fail() is called and callbacks are sent appropriately
- Added test fixtures with mock_failure_callback to validate callback invocation
- Changed pyproject.toml Python version requirement (unrelated)

Why this matters:
Tasks stuck in queued state can now be properly cleaned up at both the Airflow scheduler
and executor levels, preventing resource leaks and ensuring failure callbacks are executed
for proper error handling and alerting.

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@potiuk potiuk force-pushed the tqt-callbacks-2-11 branch from 9c67a68 to abcedd7 Compare February 13, 2026 13:55
@potiuk
Copy link
Member

potiuk commented Feb 13, 2026

Nicely green now... Merging.

@potiuk potiuk merged commit d5fbdd6 into apache:v2-11-test Feb 13, 2026
39 checks passed
potiuk pushed a commit that referenced this pull request Feb 16, 2026
…ailure callbacks (#53038)

This commit addresses the handling of tasks that remain stuck in the queued state
beyond the configured retry threshold. Previously, these tasks were marked as failed
in the database but the executor was not properly notified, leading to inconsistent
state between Airflow and the executor.

Changes made:
- Modified _maybe_requeue_stuck_ti() to accept executor parameter and call executor.fail()
  when tasks exceed requeue attempts, ensuring the executor is notified of task failures
- Added logic to retrieve the DAG and task object to check for on_failure_callback
- When a failure callback exists, create a TaskCallbackRequest and send it via executor
  to ensure failure callbacks are invoked for stuck queued tasks
- Updated tests to verify that executor.fail() is called and callbacks are sent appropriately
- Added test fixtures with mock_failure_callback to validate callback invocation
- Changed pyproject.toml Python version requirement (unrelated)

Why this matters:
Tasks stuck in queued state can now be properly cleaned up at both the Airflow scheduler
and executor levels, preventing resource leaks and ensuring failure callbacks are executed
for proper error handling and alerting.

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

legacy ui Whether legacy UI change should be allowed in PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with task_queued_timeout Causing Silent Task Failures in Airflow 2.10.5

4 participants