Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Sep 4, 2020

Once HA mode for scheduler lands, we can no longer reset orphaned
task by looking at the tasks in (the memory of) the current executor.

This changes it to keep track of which (Scheduler)Job queued/scheduled a
TaskInstance (the new "queued_by_job_id" column stored against
TaskInstance table), and then we can use the existing heartbeat
mechanism for jobs to notice when a TI should be reset.

As part of this the existing implementation of
reset_state_for_orphaned_tasks has been moved out of BaseJob in to
BackfillJob -- as only this and SchedulerJob had these methods, and the
SchedulerJob version now operates differently

@ashb ashb added the area:Scheduler including HA (high availability) scheduler label Sep 4, 2020
@ashb ashb requested review from XD-DENG, kaxil and turbaszek September 4, 2020 17:48
@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch from c81a3bf to b5c067c Compare September 4, 2020 17:53
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor non-code suggestion. LGTM otherwise

XD-DENG
XD-DENG previously requested changes Sep 4, 2020
Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents. Let me know if I misunderstood anything please.

@ashb
Copy link
Member Author

ashb commented Sep 4, 2020

Looks like this breaks on SQLite --- will need to take a look.

@XD-DENG
Copy link
Member

XD-DENG commented Sep 4, 2020

Looks like this breaks on SQLite --- will need to take a look.

Should be due to SQLite dialect doesn’t support multiple-table criteria within UPDATE (log line https://github.com/apache/airflow/runs/1073216815#step:6:1491)

@ashb
Copy link
Member Author

ashb commented Sep 4, 2020

I need to extend the unit tests to check that it correctly tests the new functionality (not sure how I missed that) -- namely that TIs from a SchedulerJob that has "stalled"/stopped heartbeating are reset.

@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch from 4b05a14 to 61017d0 Compare September 4, 2020 21:26
@ashb ashb marked this pull request as draft September 4, 2020 21:26
@dimberman
Copy link
Contributor

@ashb lemme know if you'd like me to look into the SQLite issue as well

@ashb
Copy link
Member Author

ashb commented Sep 7, 2020

@ashb lemme know if you'd like me to look into the SQLite issue as well

Please. I suspect we will have to have an if sqlite: and not do it as a single query for that db.

The lastest push I tried using a CTE - that seems to have also broken mysql so switching back to a subquery is the first step

@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch from 54f808a to dea0e30 Compare September 9, 2020 09:29
@ashb ashb requested a review from XD-DENG September 9, 2020 14:35
@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch 3 times, most recently from c6e3b0e to 1862ce5 Compare September 9, 2020 16:23
@ashb ashb closed this Sep 9, 2020
@ashb ashb reopened this Sep 9, 2020
@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch 2 times, most recently from 7ba1561 to 5b811e8 Compare September 10, 2020 08:44
@ashb ashb marked this pull request as ready for review September 10, 2020 11:54
Once HA mode for scheduler lands, we can no longer reset orphaned
task by looking at the tasks in (the memory of) the current executor.

This changes it to keep track of which (Scheduler)Job queued/scheduled a
TaskInstance (the new "queued_by_job_id" column stored against
TaskInstance table), and then we can use the existing heartbeat
mechanism for jobs to notice when a TI should be reset.

As part of this the existing implementation of
`reset_state_for_orphaned_tasks` has been moved out of BaseJob in to
BackfillJob -- as only this and SchedulerJob had these methods, and the
SchedulerJob version now operates differently
@ashb ashb force-pushed the detect-orphaned-tasks-by-scheduler-job-id branch from 96b3647 to 83315a9 Compare September 10, 2020 14:36
@ashb ashb merged commit 63b6e53 into apache:master Sep 10, 2020
@ashb ashb deleted the detect-orphaned-tasks-by-scheduler-job-id branch September 10, 2020 16:01
ashb added a commit to astronomer/airflow that referenced this pull request Sep 11, 2020
Something slightly odd is happening here.

In apache#10729 it passed the mypy tests with this in place, but now if I make
a change to this file (or to scheduler_job.py, which imports DagRun) I
get an error _with_ this.
kaxil pushed a commit that referenced this pull request Sep 11, 2020
Something slightly odd is happening here.

In #10729 it passed the mypy tests with this in place, but now if I make
a change to this file (or to scheduler_job.py, which imports DagRun) I
get an error _with_ this.
@mik-laj mik-laj added the AIP-15 label Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants