Check max_execution_time in the pipeline and pulling executors by Algunenano · Pull Request #31636 · ClickHouse/ClickHouse

Algunenano · 2021-11-22T15:25:29Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve the max_execution_time checks. Fixed some cases when timeout checks do not happen and query could run too long.

Detailed description / Documentation draft:

Adds checks in the pipelines (PipelineExecutor, PullingAsyncPipelineExecutor, PullingPipelineExecutor) to better respect the max_execution_time setting.
If timeout_overflow_mode is set to throw (default) the PipelineExecutor is the most likely to cancel the query mentioned in JOIN + GROUP BY doesn't respect timeouts or KILL requests #26554.
If timeout_overflow_mode is set to break both pulling executors will stop pulling data and finish the query with whatever data was already output.
It also removes num_queries_increment from QueryStatus since it wasn't used.

The checks inside of PipelineExecutor didn't seem excesive so I don't expect it to affect performance, but let's see what the tests say. Iis a better way or place to control the timeouts? cc @KochetovNicolai

Closes #26554 : the KILL part was implemented in #26675 for 21.9+ and this implements the timeout checks.

Closes #31657

Algunenano · 2021-11-23T11:08:33Z

Looking at the failures:

02010_lc_native -> Flaky. Reported at Flaky 02010_lc_native #31653
01059_storage_file_compression -> Flaky. Fix and refactor WriteBiffer-s a little #31265 seems to be working on that
QueryStatus destructor crash: This has to be my fault.
00613_shard_distributed_max_execution_time: Likely my fault too
Performance tests: Changes look related to this change (at least the one with many function calls)
00623_replicated_truncate_table_zookeeper_long: Not sure. It might be related to the changes, but it's not waiting for replicas on TRUNCATE so I'm not surprised it might fail. It's rare but the failure does appear in master too , that is, the test is flaky -> Flaky 00623_replicated_truncate_table_zookeeper_long #31655

I'm investigating the true failures:

QueryStatus crash: The problem comes from throwing in the constructor because of the checkTimeLimit call, since that leaves the QueryStatus with a dead pointer to the PipelineExecutor. There is an assert in QueryStatus to detect that on destruction it won't have any reference to any executors (since they should all have finished and remove its reference) and that was the cause of the crash. Fixed by moving the addition to the end of the constructor and removing the call to checkTimeLimit there.
00613_shard_distributed_max_execution_time: Flaky (some failures in master). Reported at Flaky 00613_shard_distributed_max_execution_time #31657 but note that it's made worse (more flaky) with this change, likely because the timeout is checked more often.
Performance: After fixing the crash and running those queries locally I don't see any differences in performance between master and this last commit.

…ystatus Otherwise the query status would keep a pointer to the executor which is dying at that very moment

Algunenano · 2021-11-23T17:54:04Z

Current failures:

02023_storage_filelog: Flaky (Check dependencies on DROP TABLE #30977 (comment))
Clang tidy: Is drunk. Doesn't make sense to me.
Performance: I've seen the reinterpret_as tests failing in other places but not the other things.

I'll review this tomorrow plus any other thing that appears. Comments are welcome at any time.

src/Processors/Executors/PipelineExecutor.cpp

Co-authored-by: Azat Khuzhin <[email protected]>

Algunenano · 2021-11-24T13:25:50Z

02008_materialize_column failures because it's broken in master (currently reverted and being reapplied in #31693)

KochetovNicolai · 2021-11-26T07:47:47Z

src/Processors/Executors/PullingPipelineExecutor.cpp

    if (!executor)
        executor = std::make_shared<PipelineExecutor>(pipeline.processors, pipeline.process_list_element);

+    if (!executor->checkTimeLimit())


Note: if we throw exception here, executor is not stopped (until the moment of ~PullingPipelineExecutor)

I understand that's not an issue, right? If nobody catches the exception the executor destructor will be called, if somebody does then it will need to handle it and eventually the executor will be cancelled.

It is still not very cool. E.g. executor itself may continue working (e.g. in case of bug)., and we may wait for a long time in destructor.

KochetovNicolai · 2021-11-26T07:50:19Z

src/Processors/Executors/PipelineExecutor.cpp

+        auto settings = process_list_element->context.lock()->getSettings();
+        limits.max_execution_time = settings.max_execution_time;
+        overflow_mode = settings.timeout_overflow_mode;


I think it may be more reasonable to add timeout check logic into process_list_element.

Makes sense to me. I'll add a checkTimeLimit() to process_list_element and use it from the PipeLineExecutor

Algunenano · 2021-12-02T12:27:55Z

Ey @KochetovNicolai does this need any changes or improvements? I want to build on top of this to be able to cancel scalar subqueries (#1576) but I'd rather keep things separate as otherwise I end up with large PRs that are hard to review.

KochetovNicolai · 2021-12-02T12:53:51Z

src/Processors/Executors/PipelineExecutor.cpp

+        bool continuing = process_list_element->checkTimeLimit();
+        // We call cancel here so that all processors are notified and tasks waken up
+        // so that the "break" is faster and doesn't wait for long events
+        if (!continuing)
+            cancel();


well, I don't like we don't call cancel in case of exception from process_list_element->checkTimeLimit().

What can we do:

add softer check to process_list_element, something like bool isTimeLimitExceeded() which newer throws

PipelineExecutor::checkTimeLimit uses this isTimeLimitExceeded version and just cancel. So, it's like a soft check.

In other places like PipelineExecutor::finalizeExecution we call regular process_list_element->checkTimeLimit() and throw exception if needed.

What do you think of having something like:

bool PipelineExecutor::checkTimeLimitSoft() { try { return checkTimeLimit(); } catch (...) { cancel(); return false; } }

Then calling checkTimeLimitSoft inside the PullingPipelineExecutor and PullingAsyncPipelineExecutor and in the executeStepImpl loop, and checkTimeLimit and the start and end of the execution. I think that fits what you were saying right?

Yes, the idea is the same.
But let's try to avoid try/catch(...) with ignoring exception if possible :)

KochetovNicolai · 2021-12-02T12:54:44Z

Sorry for not commenting for so long

KochetovNicolai · 2021-12-02T13:01:04Z

We need just to check time limit for every iteration of executeImpl. But avoid calling clock_gettime very often.
Maybe add timer_fd or something. But not in this pr.

robot-clickhouse added the pr-improvement Pull request with some product improvements label Nov 22, 2021

Algunenano added 3 commits November 23, 2021 09:23

Check max_execution_time in the pipeline and pulling executors

c6d3065

Style

f39648d

Only print one distinct error per command

146c4a1

nikitamikhaylov self-assigned this Nov 23, 2021

azat mentioned this pull request Nov 23, 2021

Fix 02010_lc_native flakiness (Query with id = 123456 is already running) #31556

Merged

Algunenano mentioned this pull request Nov 23, 2021

Flaky 00623_replicated_truncate_table_zookeeper_long #31655

Closed

PipelineExecutor: Avoid throwing in constructor after saving the quer…

cbe3a47

…ystatus Otherwise the query status would keep a pointer to the executor which is dying at that very moment

Algunenano mentioned this pull request Nov 23, 2021

Flaky 00613_shard_distributed_max_execution_time #31657

Closed

Algunenano force-pushed the pull_timeout branch from de3fe0e to cbe3a47 Compare November 23, 2021 13:20

Fix 00613_shard_distributed_max_execution_time flakyness

15dc86b

Algunenano mentioned this pull request Nov 23, 2021

execution of the query ignores the max_execution_time setting #31349

Closed

azat reviewed Nov 23, 2021

View reviewed changes

src/Processors/Executors/PipelineExecutor.cpp Outdated Show resolved Hide resolved

Algunenano and others added 2 commits November 24, 2021 09:58

Use getContext

cb6f99f

Co-authored-by: Azat Khuzhin <[email protected]>

Make clang-tidy happy

34d0f40

Algunenano mentioned this pull request Nov 24, 2021

Test 02112_with_fill_interval is timezone dependent #31690

Closed

KochetovNicolai reviewed Nov 26, 2021

View reviewed changes

Algunenano added 4 commits November 26, 2021 11:37

Merge remote-tracking branch 'blessed/master' into pull_timeout

be60759

Move limits check to ProcessList

c498b7b

Reduce header exposure to ProcessList.h

eb0435c

Merge remote-tracking branch 'blessed/master' into pull_timeout

a7ae715

KochetovNicolai reviewed Dec 2, 2021

View reviewed changes

Use softer checks

5662d0a

Algunenano added 2 commits December 2, 2021 14:57

02122_join_group_by_timeout: Unify max process timeouts

37572f7

Don't forget to check the output of checkTimeLimitSoft

755ba5d

KochetovNicolai approved these changes Dec 3, 2021

View reviewed changes

KochetovNicolai merged commit 91c4c89 into ClickHouse:master Dec 6, 2021

Algunenano mentioned this pull request Dec 6, 2021

Be able to KILL scalar queries #32271

Merged

CurtizJ mentioned this pull request Feb 1, 2022

Fix metric Query #34224

Merged

Conversation

Algunenano commented Nov 22, 2021 • edited by KochetovNicolai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano commented Nov 23, 2021

Uh oh!

Uh oh!

Algunenano commented Nov 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Algunenano commented Dec 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KochetovNicolai commented Dec 2, 2021

Uh oh!

KochetovNicolai commented Dec 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Algunenano commented Nov 22, 2021 •

edited by KochetovNicolai

Loading

Algunenano commented Nov 23, 2021 •

edited

Loading