Skip to content

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Feb 5, 2025

This extensive PR is focused on improving error handling on MSE, but partially in SSE and TSE. Error handling includes changes in the error responses sent to clients but also logs in servers, brokers and controllers.

Logging has been reduced to not logging stack traces when they are not useful. Logging is also improved to include the request-id in MSE and SSE queries and stage id in MSE queries. Instead of spreading the query context to all places where logs are created, this PR injects these values using MDC. This means that:

  1. Code doesn't have to change that much.
  2. However, to modify the MDC in some parts, I needed to add a try-catch. I recommend reviewing the PR with the hidden white chances.
  3. To log these properties, log4j2.xml needs to be changed. This PR includes a change in the log4j2.xml used for quickstars. Actual deployments will likely use their log4j2.xml files, and they will need to be modified to log request IDs and stage IDs. Please remember that this is optional. Logs are still as helpful as before, even if MDC properties are unused. Specifically, logs that already included the request ID or stage ID haven't been changed.

This PR tries to solve most problems in #14950. I was trying to split changes in different PRs, but given how each change affects the other, it was difficult.

It is difficult to compare logs in here (or to add tests that verify logs), so I've created this document showing the changes. Notice there is one tab per case of study

@gortiz gortiz force-pushed the improve_exceptions_broker branch from b9c2b52 to 6729d27 Compare February 5, 2025 10:54
@codecov-commenter
Copy link

codecov-commenter commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 7.03851% with 700 lines in your changes missing coverage. Please review.

Project coverage is 33.94%. Comparing base (59551e4) to head (e4a16b2).
Report is 1673 commits behind head on master.

Files with missing lines Patch % Lines
...va/org/apache/pinot/spi/trace/MdcRequestScope.java 0.00% 161 Missing ⚠️
...requesthandler/MultiStageBrokerRequestHandler.java 0.00% 95 Missing ⚠️
...t/controller/api/resources/PinotQueryResource.java 23.07% 58 Missing and 2 partials ⚠️
...che/pinot/core/transport/grpc/GrpcQueryServer.java 0.00% 41 Missing ⚠️
.../apache/pinot/core/query/executor/MdcExecutor.java 0.00% 33 Missing ⚠️
...not/query/mailbox/channel/MailboxStatusLogger.java 0.00% 31 Missing ⚠️
.../pinot/query/service/dispatch/QueryDispatcher.java 0.00% 25 Missing ⚠️
...not/query/runtime/operator/MultiStageOperator.java 0.00% 21 Missing ⚠️
...va/org/apache/pinot/spi/trace/LoggerConstants.java 0.00% 19 Missing ⚠️
...sthandler/BaseSingleStageBrokerRequestHandler.java 6.25% 15 Missing ⚠️
... and 32 more

❗ There is a different number of reports uploaded between BASE (59551e4) and HEAD (e4a16b2). Click for more details.

HEAD has 43 uploads less than BASE
Flag BASE (59551e4) HEAD (e4a16b2)
integration 7 0
integration2 3 0
temurin 12 3
java-21 7 2
skip-bytebuffers-true 3 1
skip-bytebuffers-false 7 2
unittests 5 3
unittests1 2 0
java-11 5 1
integration1 2 0
custom-integration1 2 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #14994       +/-   ##
=============================================
- Coverage     61.75%   33.94%   -27.82%     
- Complexity      207      680      +473     
=============================================
  Files          2436     2718      +282     
  Lines        133233   152591    +19358     
  Branches      20636    23541     +2905     
=============================================
- Hits          82274    51792    -30482     
- Misses        44911    96615    +51704     
+ Partials       6048     4184     -1864     
Flag Coverage Δ
custom-integration1 ?
integration ?
integration1 ?
integration2 ?
java-11 33.91% <7.03%> (-27.80%) ⬇️
java-21 33.93% <7.03%> (-27.69%) ⬇️
skip-bytebuffers-false 33.94% <7.03%> (-27.81%) ⬇️
skip-bytebuffers-true 33.92% <7.03%> (+6.19%) ⬆️
temurin 33.94% <7.03%> (-27.82%) ⬇️
unittests 33.94% <7.03%> (-27.81%) ⬇️
unittests1 ?
unittests2 33.94% <7.03%> (+6.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

* exception, this exception is designed to be thrown and caught by the query engine as any other standard Java
* exception.
*/
public class QException extends PinotRuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest naming this QueryException. Since you are cleaning up org.apache.pinot.common.exception.QueryException, there is a TODO there to rename that to QueryExceptionUtils, and you can do it together to avoid name conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to have an even longer PR, but if you think it is the best way, I'll try to do that as well

import java.util.Map;


public class MdcRequestScope implements RequestScope {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some javadoc. What does mdc stand for?

public float toFloat(Object value) {
// NOTE: No need to trim here because Float.valueOf() will trim the string
return Float.parseFloat(value.toString());
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used in a lot of places and we want to ensure highest performance. It expects the input to be properly formatted, and it is not only used by the query engine. Can you do the try-catch on the caller side?

public void shutDown() {
_queryDispatcher.shutdown();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) revert

@gortiz gortiz marked this pull request as draft February 6, 2025 15:10
@gortiz
Copy link
Contributor Author

gortiz commented Feb 6, 2025

I'm convering the PR into a draft given I've found some stack traces leaking in the tests. My plan is to apply some extra changes. If it ends up being too difficult to review I'll try to split it once I reach the desired state

@gortiz
Copy link
Contributor Author

gortiz commented Mar 19, 2025

This have been split into several other PRs. Some of them have been merged and #15277 is at the time the only one open

@gortiz gortiz closed this Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants