-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-435] Multiprocessing Scheduler is very slow #1761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-435] Multiprocessing Scheduler is very slow #1761
Conversation
|
Thanks for the documentation and for the change. Just waiting for travis and landscape to pass. Can you run these in your fork and send me a link if they pass. And will +1 once they do either in your fork or here. |
|
I'll test it hopefully today |
|
Ok i have just tested it. It has improved, but I also see that the delay in processing is applied between scheduling of the task instances. This means that a first instance is launched and then the scheduler will wait ~3min before scheduling the next task (with the old config values). I don't think that is the idea behind these config items. At least I think that tasks from a running dag should be scheduled as soon as they are available for running, no extra delay should be added on behalf of the scheduler. What is the reasoning behind this configuration and delay by the way? In addition I observe that Pools are often "under used" (only 6 slots used but 8 available, task instances are queued) and also slots_used are not recorded in the UI view. |
|
The idea behind the delay was that there wasn't a need to process DAGs so quickly - with higher thread counts you could be processing a given DAG every few seconds. To reduce the log volume and potential load on the DB, the configurable delay was added. For the case that you're seeing - could you enable debug logging and send me the log contents? |
|
Ok will do when I have to chance. I also noticed that the scheduler now quits by default after X time. I think this should not be the default and in addition documentation and upgrade notes should be updated on this (I didn't check, but assume it wasn't). I was really caught by surprise with this. |
airflow/configuration.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be the default in my opinion. In addition the comment is ambiguous, I was reading it as if it would restart itself, but that is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not happy with this as this is tied to CeleryExecutor not behaving as it should. Now we have "num_runs" and "max_run_duration" that both influence how long the scheduler will be running. The old num_runs has changed behavior (not documented) and max_run_duration always applies, so all the startup scripts (systemd/upstart) are currently not valid. That is a bit too much regression to my taste.
I really don't understand the need for both and I think we should probably revert on this (particular bit). I still really appreciate the work, but it needs some extra love and care. @plypaul
|
Ok I executed this DAG for integration tests. This exhbibits the issue with slow scheduling when I turned on the DAG at 17:24 Logs Then there is a gap before the next tasks are scheduled (notice the 3mins): |
|
Further observations: the dag run state is not being updated (max runs gets reached) and dags are completely skipped by processing Probably process_task_instances needs to be called before create_dagrun. Are you not encountering this issue? |
|
change commit message to |
|
I Confirmed a fix for the "not processing part" by moving self._ process_task_instances to the beginning of the function to have it always executed. I am now seeing that this issue was introduced in #1716 . Several PRs try to address the issue introduced in #1716 , but no solution has been derived yet |
0a66695 to
ba03268
Compare
Add additional comments for various scheduler configuration options.
ba03268 to
6ef2b22
Compare
Current coverage is 66.12% (diff: 100%)@@ master #1761 diff @@
==========================================
Files 127 127
Lines 9750 9750
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 6447 6447
Misses 3303 3303
Partials 0 0
|
|
@plypaul any updates here? Can you have someone review and merge this at the earliest? Else, I will need to close this for inactivity. |
|
Closing for inactivity. Please reopen when ready! |
Dear Airflow Maintainers,
Please accept this PR that addresses the following issues:
One of the scheduler default values may make it look like it's not scheduling as fast as it should, so changing the default so that it processes DAG definition files as fast as possible.
Testing Done:
specify why you think this is not required. We like to improve our
coverage so a non existing test is even a better reason to include one.
Reminders for contributors (REQUIRED!):
Airflow's JIRA.
For example, a PR called "[AIRFLOW-1] My Amazing PR" would close JIRA
issue Improving the search functionality in the graph view #1. Please open a new issue if required!
Summarized as follows: