Skip to content

Check max_execution_time in the pipeline and pulling executors#31636

Merged
KochetovNicolai merged 14 commits intoClickHouse:masterfrom
Algunenano:pull_timeout
Dec 6, 2021
Merged

Check max_execution_time in the pipeline and pulling executors#31636
KochetovNicolai merged 14 commits intoClickHouse:masterfrom
Algunenano:pull_timeout

Conversation

@Algunenano
Copy link
Copy Markdown
Member

@Algunenano Algunenano commented Nov 22, 2021

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve the max_execution_time checks. Fixed some cases when timeout checks do not happen and query could run too long.

Detailed description / Documentation draft:

  • Adds checks in the pipelines (PipelineExecutor, PullingAsyncPipelineExecutor, PullingPipelineExecutor) to better respect the max_execution_time setting.
  • If timeout_overflow_mode is set to throw (default) the PipelineExecutor is the most likely to cancel the query mentioned in JOIN + GROUP BY doesn't respect timeouts or KILL requests #26554.
  • If timeout_overflow_mode is set to break both pulling executors will stop pulling data and finish the query with whatever data was already output.
  • It also removes num_queries_increment from QueryStatus since it wasn't used.

The checks inside of PipelineExecutor didn't seem excesive so I don't expect it to affect performance, but let's see what the tests say. Iis a better way or place to control the timeouts? cc @KochetovNicolai

Closes #26554 : the KILL part was implemented in #26675 for 21.9+ and this implements the timeout checks.

Closes #31657

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Nov 22, 2021
@nikitamikhaylov nikitamikhaylov self-assigned this Nov 23, 2021
@Algunenano
Copy link
Copy Markdown
Member Author

Algunenano commented Nov 23, 2021

Looking at the failures:

I'm investigating the true failures:

  • QueryStatus crash: The problem comes from throwing in the constructor because of the checkTimeLimit call, since that leaves the QueryStatus with a dead pointer to the PipelineExecutor. There is an assert in QueryStatus to detect that on destruction it won't have any reference to any executors (since they should all have finished and remove its reference) and that was the cause of the crash. Fixed by moving the addition to the end of the constructor and removing the call to checkTimeLimit there.
  • 00613_shard_distributed_max_execution_time: Flaky (some failures in master). Reported at Flaky 00613_shard_distributed_max_execution_time #31657 but note that it's made worse (more flaky) with this change, likely because the timeout is checked more often.
  • Performance: After fixing the crash and running those queries locally I don't see any differences in performance between master and this last commit.

…ystatus

Otherwise the query status would keep a pointer to the executor which is dying at
that very moment
@Algunenano
Copy link
Copy Markdown
Member Author

Current failures:

I'll review this tomorrow plus any other thing that appears. Comments are welcome at any time.

@Algunenano
Copy link
Copy Markdown
Member Author

02008_materialize_column failures because it's broken in master (currently reverted and being reapplied in #31693)

if (!executor)
executor = std::make_shared<PipelineExecutor>(pipeline.processors, pipeline.process_list_element);

if (!executor->checkTimeLimit())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if we throw exception here, executor is not stopped (until the moment of ~PullingPipelineExecutor)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's not an issue, right? If nobody catches the exception the executor destructor will be called, if somebody does then it will need to handle it and eventually the executor will be cancelled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still not very cool. E.g. executor itself may continue working (e.g. in case of bug)., and we may wait for a long time in destructor.

Comment on lines +49 to +51
auto settings = process_list_element->context.lock()->getSettings();
limits.max_execution_time = settings.max_execution_time;
overflow_mode = settings.timeout_overflow_mode;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be more reasonable to add timeout check logic into process_list_element.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'll add a checkTimeLimit() to process_list_element and use it from the PipeLineExecutor

@Algunenano
Copy link
Copy Markdown
Member Author

Ey @KochetovNicolai does this need any changes or improvements? I want to build on top of this to be able to cancel scalar subqueries (#1576) but I'd rather keep things separate as otherwise I end up with large PRs that are hard to review.

Comment on lines +135 to +139
bool continuing = process_list_element->checkTimeLimit();
// We call cancel here so that all processors are notified and tasks waken up
// so that the "break" is faster and doesn't wait for long events
if (!continuing)
cancel();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I don't like we don't call cancel in case of exception from process_list_element->checkTimeLimit().

What can we do:

  • add softer check to process_list_element, something like bool isTimeLimitExceeded() which newer throws
  • PipelineExecutor::checkTimeLimit uses this isTimeLimitExceeded version and just cancel. So, it's like a soft check.
  • In other places like PipelineExecutor::finalizeExecution we call regular process_list_element->checkTimeLimit() and throw exception if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having something like:

bool PipelineExecutor::checkTimeLimitSoft()
{
    try
    {
        return checkTimeLimit();
    }
    catch (...)
    {
        cancel();
        return false;
    }
}

Then calling checkTimeLimitSoft inside the PullingPipelineExecutor and PullingAsyncPipelineExecutor and in the executeStepImpl loop, and checkTimeLimit and the start and end of the execution. I think that fits what you were saying right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is the same.
But let's try to avoid try/catch(...) with ignoring exception if possible :)

@KochetovNicolai
Copy link
Copy Markdown
Member

Sorry for not commenting for so long

@KochetovNicolai
Copy link
Copy Markdown
Member

We need just to check time limit for every iteration of executeImpl. But avoid calling clock_gettime very often.
Maybe add timer_fd or something. But not in this pr.

@KochetovNicolai KochetovNicolai merged commit 91c4c89 into ClickHouse:master Dec 6, 2021
@CurtizJ CurtizJ mentioned this pull request Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky 00613_shard_distributed_max_execution_time JOIN + GROUP BY doesn't respect timeouts or KILL requests

5 participants