-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Enforce max rows in join limit on joined rows with left input as well #13922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13922 +/- ##
============================================
- Coverage 61.75% 57.81% -3.94%
+ Complexity 207 197 -10
============================================
Files 2436 2586 +150
Lines 133233 142465 +9232
Branches 20636 21885 +1249
============================================
+ Hits 82274 82362 +88
- Misses 44911 53651 +8740
- Partials 6048 6452 +404
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
| private boolean incrementJoinedRowsAndCheckLimit() throws ProcessingException { | ||
| _currentJoinedRows++; | ||
| if (_currentJoinedRows > _maxRowsInJoin) { | ||
| if (_joinOverflowMode == JoinOverFlowMode.THROW) { | ||
| throwProcessingExceptionForJoinRowLimitExceeded("Cannot process join, reached number of rows limit: " | ||
| + _maxRowsInJoin); | ||
| } else { | ||
| // Skip over remaining blocks until we reach the end of stream since we already breached the rows limit. | ||
| logger().info("Terminating join operator early as the maximum number of rows limit was reached: {}", | ||
| _maxRowsInJoin); | ||
| earlyTerminateLeftInput(); | ||
| _statMap.merge(StatKey.MAX_ROWS_IN_JOIN_REACHED, true); | ||
| return true; | ||
| } | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't _currentJoinedRows always be equal to _rows.size()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're incrementing _currentJoinedRows for every matched row and checking against the max rows limit so that we can exit early as soon as the max rows limit is breached. Alternatively, we could just modify it once after processing all the rows from a block, but that would be less accurate. This alternative would be slightly more efficient since there would be less checks and increments but it could still result in a very large number of joined rows being emitted depending on the block size. I don't think the overhead from the integer increment and limit check should be too concerning given the complexity of the other existing operations in each iteration of the main join loops, WDYT?
maxRowsInJoinquery option (https://docs.pinot.apache.org/users/user-guide-query/query-options#supported-query-options) /max_rows_in_joinjoin hint (https://docs.pinot.apache.org/users/user-guide-query/multi-stage-query/operator-types/hash_join#max_rows_in_join) is only applied to the in-memory hash table built from the join's right input.CROSS JOINlike query can cause servers to crash.THROW), an exception is thrown and the query fails. If the join overflow mode isBREAK, then we early terminate the left input and return the joined rows.