Skip to content

Conversation

@vrajat
Copy link
Contributor

@vrajat vrajat commented Jul 15, 2024

OOM Protection in Pinot depends on accounting of resource usage of threads that execute a query. Resource accounting is currently supported for Single-Stage queries only. This PR adds instrumentation for Multi-Stage query engine as well. OOM protection is automatically available as resource usage is accounted in same data structures checked by the OOM protection thread.

Tracing recognizes two types of threads: runners and workers. Runner submits worker tasks to an executor service. Runner setups a parent context with identification information such as query id. The id information in parent context is used by workers when reporting resource usage.

Query execution in broker and servers is tracked. In the broker, the runner is the same thread that executes MultiStageRequestBrokerHandler. In the server, the runner is the QueryServer. Workers are setup in OpChainSchedulerService.

The following operators have been instrumented. Instrumentation calls Tracing.sample() to sample resource usage after processing a block from the input operator.

  • AggregateOperator
  • HashJoinOperator
  • SetOperator
  • SortOperator
  • WindowAggregateOperator
    These operators where chosen because they allocate memory.

MailBoxSendOperator and MailboxReceiveOperator have also been instrumented to sample usage across other operators.

Tags: multi-stage, observability

@vrajat vrajat changed the title Instrument Multi-Stage Operator Execution to Track Resource Usage Initialize Resource Usage Accountant in Multi-Stage Operators Jul 15, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2024

Codecov Report

Attention: Patch coverage is 63.63636% with 20 lines in your changes missing coverage. Please review.

Project coverage is 57.81%. Comparing base (59551e4) to head (a1fb062).
Report is 1815 commits behind head on master.

Files with missing lines Patch % Lines
.../main/java/org/apache/pinot/spi/trace/Tracing.java 0.00% 8 Missing ⚠️
...e/pinot/spi/accounting/ThreadExecutionContext.java 0.00% 4 Missing ⚠️
...accounting/CPUMemThreadLevelAccountingObjects.java 50.00% 3 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java 0.00% 2 Missing ⚠️
...not/query/runtime/operator/MultiStageOperator.java 50.00% 1 Missing and 1 partial ⚠️
...re/accounting/PerQueryCPUMemAccountantFactory.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13598      +/-   ##
============================================
- Coverage     61.75%   57.81%   -3.95%     
+ Complexity      207      197      -10     
============================================
  Files          2436     2587     +151     
  Lines        133233   142508    +9275     
  Branches      20636    21888    +1252     
============================================
+ Hits          82274    82386     +112     
- Misses        44911    53667    +8756     
- Partials       6048     6455     +407     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 57.77% <63.63%> (-3.94%) ⬇️
java-21 57.70% <63.63%> (-3.93%) ⬇️
skip-bytebuffers-false 57.80% <63.63%> (-3.94%) ⬇️
skip-bytebuffers-true 57.66% <63.63%> (+29.93%) ⬆️
temurin 57.81% <63.63%> (-3.95%) ⬇️
unittests 57.80% <63.63%> (-3.95%) ⬇️
unittests1 40.52% <66.03%> (-6.37%) ⬇️
unittests2 27.92% <0.00%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vrajat vrajat force-pushed the rv-v2-thread-usage branch from f8d844a to fb681c3 Compare July 16, 2024 05:17
@vrajat vrajat changed the title Initialize Resource Usage Accountant in Multi-Stage Operators Add OOM Protection Support for Multi-Stage Queries Jul 18, 2024
@vrajat vrajat force-pushed the rv-v2-thread-usage branch from 06d32bd to e4f421a Compare July 22, 2024 16:55
@vrajat vrajat marked this pull request as ready for review July 25, 2024 08:15
Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing a few files will finish by this week

if ((size & size - 1) == 0 && size < _maxRowsInHashTable && size < Integer.MAX_VALUE / 2) { // is power of 2
hashCollection.ensureCapacity(Math.min(size << 1, _maxRowsInHashTable));
}
hashCollection.add(row);
Copy link
Contributor

@jasperjiaguo jasperjiaguo Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered the option to do sampleAndCheckInterruptionPeriodically() every X rows in this loop? Similar for some other operators. If the rightBlock size itself is large and the operation itself is mem-heavy (e.g. heap, hashmap add) the OOM may happen within a block if we do not sample at a finer granularity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look into that. I chose to check for every block for a couple of reasons.

  • The block size is ~10000 rows which is very similar to the constant - 8192.
  • It was easier to call this function after every block and not worry about specific if-else & try-catch in every implementation.

In this specific case too, the hash table is being built and sample/interruption will run every 10000 rows.

@@ -0,0 +1,246 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add a test case of query being killed in MSE execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the trheading model is different, I couldnt make those tests work. I'll give it a try again.

@vrajat vrajat force-pushed the rv-v2-thread-usage branch from 015d0bf to 214d5bb Compare August 21, 2024 05:43
@gortiz gortiz self-assigned this Sep 2, 2024
@vrajat vrajat force-pushed the rv-v2-thread-usage branch from 214d5bb to 59e3caf Compare September 3, 2024 04:23
public static PipelineBreakerResult executePipelineBreakers(OpChainSchedulerService scheduler,
MailboxService mailboxService, WorkerMetadata workerMetadata, StagePlan stagePlan,
Map<String, String> opChainMetadata, long requestId, long deadlineMs) {
Map<String, String> opChainMetadata, long requestId, long deadlineMs, ThreadExecutionContext parentContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentContext should be nullable.

nit/suggestion: Can we have a executePipelineBreaker(OpChainSchedulerService scheduler,
MailboxService mailboxService, WorkerMetadata workerMetadata, StagePlan stagePlan,
Map<String, String> opChainMetadata, long requestId, long deadlineMs) that just calls this method with null as last argument? That would reduce this PR by a large margin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change. This was useful in PipelineBreakerTest only. In other usages, a variable is passed through and cannot be removed.

@vrajat
Copy link
Contributor Author

vrajat commented Sep 3, 2024

Test failure is unrelated. Looks like a flaky test

@gortiz gortiz merged commit d008709 into apache:master Sep 3, 2024
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes observability multi-stage Related to the multi-stage query engine labels Sep 3, 2024
rajagopr pushed a commit to rajagopr/pinot that referenced this pull request Sep 11, 2024
track cpu and memory usage in multi-stage queries if query resource usage tracking is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature multi-stage Related to the multi-stage query engine observability release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants