Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Sep 15, 2020

This can happen when a task is enqueued by one executor, and then that
scheduler dies/exits.

The default fallback behaviour is unchanged -- that queued tasks are
cleared and then and then later rescheduled.

But for Celery we can do better -- if we record the Celery-generated
task_id, we can then re-create the AsyncResult objects for orphaned
tasks at a later date.

However since Celery just reports all AsyncResult as "PENDING", even if
they aren't tasks currently in the broker queue, we need to apply a
timeout to "unblock" these tasks in case they never actually made it to
the Celery broker.

This all means that we can adopt tasks that have been enqueued another
CeleryExecutor if it dies, without having to clear the task and slow
down. This is especially useful as the task may have already started
running, and while clearing it would stop it, it's better if we don't
have to reset it!

Part of #9630


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:docs area:Scheduler including HA (high availability) scheduler labels Sep 15, 2020
@ashb ashb requested review from mik-laj, potiuk and turbaszek and removed request for mik-laj and turbaszek September 15, 2020 09:32
@ashb ashb added the AIP-15 label Sep 15, 2020
@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch from b58fac4 to c1a3f9f Compare September 15, 2020 09:33
@ashb ashb requested review from mik-laj and turbaszek September 15, 2020 09:40
@kaxil kaxil force-pushed the adopt-dont-reset-celery-tasks branch 4 times, most recently from 4f3cf22 to 627207a Compare September 15, 2020 12:17
@kaxil kaxil force-pushed the adopt-dont-reset-celery-tasks branch from 627207a to 2871b48 Compare September 15, 2020 13:24
@ashb ashb requested review from XD-DENG and houqp September 15, 2020 19:10
Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor suggestion

@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch from 2871b48 to 12debd5 Compare September 16, 2020 09:01
@ashb ashb requested a review from XD-DENG September 16, 2020 09:02
@ashb
Copy link
Member Author

ashb commented Sep 16, 2020

The doc tests are now failing with:

/opt/airflow/docs/_api/airflow/executors/base_executor/index.rst:198:more than one target found for cross-reference 'TaskInstance': airflow.models.TaskInstance, airflow.models.taskinstance.TaskInstance

But I didn't change the imports in that file :/

Oh I missed Kaxil's earlier fix in my rebase+force-push

@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch 2 times, most recently from bbb6f5f to 03a9290 Compare September 16, 2020 10:17
@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch from 03a9290 to 7d13295 Compare September 16, 2020 11:49
@ashb ashb requested a review from XD-DENG September 16, 2020 13:02
@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch 2 times, most recently from 64c2215 to 7c05b37 Compare September 16, 2020 13:57
@ashb ashb requested a review from turbaszek September 16, 2020 15:58
Copy link
Member

@turbaszek turbaszek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor comment.

@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch from 3f4b2ea to f1967f6 Compare September 16, 2020 16:44
@ashb
Copy link
Member Author

ashb commented Sep 16, 2020

Right, final rebase done, will merge once tests are green.

(I wish Github had a button for that)

This can happen when a task is enqueued by one executor, and then that
scheduler dies/exits.

The default fallback behaviour is unchanged -- that queued tasks are
cleared and then and then later rescheduled.

But for Celery we can do better -- if we record the Celery-generated
task_id, we can then re-create the AsyncResult objects for orphaned
tasks at a later date.

However since Celery just reports all AsyncResult as "PENDING", even if
they aren't tasks currently in the broker queue, we need to apply a
timeout to "unblock" these tasks in case they never actually made it to
the Celery broker.

This all means that we can adopt tasks that have been enqueued another
CeleryExecutor if it dies, without having to clear the task and slow
down. This is especially useful as the task may have already started
running, and while clearing it would stop it, it's better if we don't
have to reset it!

Co-authored-by: Kaxil Naik <[email protected]>
@ashb ashb force-pushed the adopt-dont-reset-celery-tasks branch from f1967f6 to 092b9c0 Compare September 16, 2020 17:45
@kaxil kaxil merged commit 59dad1a into apache:master Sep 16, 2020
@ashb ashb deleted the adopt-dont-reset-celery-tasks branch December 7, 2020 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants