-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-3607] Optimize dep checking when depends on past set and concurrency limit #7503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…erformance improvement
Codecov Report
@@ Coverage Diff @@
## master #7503 +/- ##
==========================================
- Coverage 86.82% 86.54% -0.29%
==========================================
Files 891 891
Lines 42095 42091 -4
==========================================
- Hits 36549 36426 -123
- Misses 5546 5665 +119
Continue to review full report at Codecov.
|
|
WDYT @houqp ? |
|
@houqp What do you think? |
|
sorry, my github mention is overwhelmed, so I missed them. taking a look at this now. |
|
This change looks correct to me. In my original fix, I also added unit tests to prevent regressions and looks like those tests are still passing, so we should be good 👍
That said, @amichai07 are you sure your patch actually results in performance improvement? Looking at the code, |
|
@houqp In Q2 I want to add more tests that allow you to count the number of queries. |
|
@houqp it results in a performance improvement because it passes 'finished_tasks' as an argument in dep_context so no more db queries are made for getting those finished_tasks afterwards. |
|
@amichai07 ha, i see, thanks for the explanation. The performance regression test setup by @mik-laj looks great. You might want to add for your optimization to guard against regressions from future changes :) |
|
@ashb WDYT? |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
@mik-laj Is it worth adding one of your query-count style tests here do you think? |
|
We want to add the following test case:
|
[AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep (apache#4751) This decreases scheduler delay between tasks by about 20% for larger DAGs, sometimes more for larger or more complex DAGs. The delay between tasks can be a major issue, especially when we have dags with many subdags, figures out that the scheduling process spends plenty of time in dependency checking, we took the trigger rule dependency which calls the db for each task instance, we made it call the db just once for each dag_run [AIRFLOW-3607] fix scheduler bug related to concurrency and depends on past (apache#7402) commit 50efda5 introduced a bug that prevents scheduler from scheduling tasks with the following properties: * has depends on past set to True * has custom concurrency limit [AIRFLOW-3607] Optimize dep checking when depends on past set and concurrency limit (apache#7503)
This change fixes the bug identified by @houqp (described in #7402), but without losing the performance improvement which was introduced in refactor #4751.
Issue link: AIRFLOW-3607
[AIRFLOW-NNNN]. AIRFLOW-NNNN = JIRA ID*In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.