Skip to content

Conversation

@walterddr
Copy link
Contributor

@walterddr walterddr commented Mar 13, 2023

seeing failure since merged of #10322

since failure is not so reproducible. targeting issues might be related to the in-mem vs. grpc mailbox usage.

@walterddr walterddr force-pushed the hotfix_disable_unstable_test branch from 612e961 to 5fa5da2 Compare March 13, 2023 23:35
@Jackie-Jiang
Copy link
Contributor

I feel this might catch a real bug, thus we should fix the bug instead of disabling the test

@walterddr walterddr marked this pull request as draft March 14, 2023 00:06
@somandal
Copy link
Contributor

@walterddr quick question, is this regarding the following failure:

Error:  Failures: 
Error:    QueryRunnerTest.testSqlWithExceptionMsgChecker:204 Exception should contain: For input string:! but found: Received error query execution result block: {1000=io.grpc.StatusRuntimeException: CANCELLED: client cancelled
java.lang.RuntimeException: io.grpc.StatusRuntimeException: CANCELLED: client cancelled
	at org.apache.pinot.query.mailbox.channel.MailboxContentStreamObserver.createErrorContent(MailboxContentStreamObserver.java:163)
	at org.apache.pinot.query.mailbox.channel.MailboxContentStreamObserver.onError(MailboxContentStreamObserver.java:143)
	at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:288)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:378)
...
Caused by: io.grpc.StatusRuntimeException: CANCELLED: client cancelled
	at io.grpc.Status.asRuntimeException(Status.java:530)
	at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:291)
	... 8 more} expected [true] but found [false]
[INFO] 
Error:  Tests run: 1020, Failures: 1, Errors: 0, Skipped: 0

My PR has run into this a couple of times, but when I run this test locally it passes

@walterddr
Copy link
Contributor Author

@walterddr quick question, is this regarding the following failure:

Error:  Failures: 
Error:    QueryRunnerTest.testSqlWithExceptionMsgChecker:204 Exception should contain: For input string:! but found: Received error query execution result block: {1000=io.grpc.StatusRuntimeException: CANCELLED: client cancelled
java.lang.RuntimeException: io.grpc.StatusRuntimeException: CANCELLED: client cancelled
	at org.apache.pinot.query.mailbox.channel.MailboxContentStreamObserver.createErrorContent(MailboxContentStreamObserver.java:163)
	at org.apache.pinot.query.mailbox.channel.MailboxContentStreamObserver.onError(MailboxContentStreamObserver.java:143)
	at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:288)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:378)
...
Caused by: io.grpc.StatusRuntimeException: CANCELLED: client cancelled
	at io.grpc.Status.asRuntimeException(Status.java:530)
	at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:291)
	... 8 more} expected [true] but found [false]
[INFO] 
Error:  Tests run: 1020, Failures: 1, Errors: 0, Skipped: 0

My PR has run into this a couple of times, but when I run this test locally it passes

Correct. i wasn't able to repro even running 100+ times in a loop locally. CC @ankitsultana it seems like the issue surfaced after #10322 but i am not 100% sure that's the issue.

@ankitsultana
Copy link
Contributor

Will take a look and get back. I have seen this error on my local as well once in the past (and for the least() query too).

@walterddr walterddr force-pushed the hotfix_disable_unstable_test branch 3 times, most recently from daad576 to 586c8c0 Compare March 14, 2023 22:30
@ankitsultana
Copy link
Contributor

Found the issue: if an OpChain runs into a error, then it tries to send the error-block to the receiver and then return.

If the OpChain returns a error-block, we issue a cancel. This has a race with the error-block sent upstream because sends are async.

Will raise a fix shortly. Thanks folks for reporting this.

@ankitsultana
Copy link
Contributor

Raised: #10425

And more follow-up items in: #10424

Main thing is the design doc which I have been punting for some time. I'll try to close it this week. I was planning to discuss all these scenarios in the same but couldn't get time to write it down.

@walterddr walterddr closed this Mar 15, 2023
@walterddr walterddr reopened this Mar 15, 2023
@walterddr walterddr force-pushed the hotfix_disable_unstable_test branch from 586c8c0 to 90e779b Compare March 15, 2023 22:55
@walterddr
Copy link
Contributor Author

repush another attempt for fixing the issue

@walterddr walterddr force-pushed the hotfix_disable_unstable_test branch from 90e779b to b489773 Compare March 15, 2023 23:35
@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2023

Codecov Report

Merging #10417 (0f4cfa8) into master (11b6bcd) will decrease coverage by 35.24%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #10417       +/-   ##
=============================================
- Coverage     63.23%   27.99%   -35.24%     
+ Complexity     5068       58     -5010     
=============================================
  Files          2028     2028               
  Lines        110632   110635        +3     
  Branches      16847    16847               
=============================================
- Hits          69955    30977    -38978     
- Misses        35505    76575    +41070     
+ Partials       5172     3083     -2089     
Flag Coverage Δ
integration1 24.55% <0.00%> (+0.16%) ⬆️
integration2 24.37% <0.00%> (+0.06%) ⬆️
unittests1 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...apache/pinot/query/mailbox/GrpcSendingMailbox.java 0.00% <0.00%> (-81.58%) ⬇️

... and 1170 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@walterddr
Copy link
Contributor Author

will close this PR for now. it showed the proper root cause and several fixes

@walterddr walterddr closed this Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants