-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Make column order deterministic in segment #10468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make column order deterministic in segment #10468
Conversation
7242a4c to
2c6042b
Compare
dddcf7d to
ce87ddb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test different column orders give same CRC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, different column order will give different CRC. This test is used to make sure the order is always the same
ce87ddb to
3aa78db
Compare
056ece1 to
d9b574c
Compare
Codecov Report
@@ Coverage Diff @@
## master #10468 +/- ##
============================================
+ Coverage 64.10% 68.80% +4.70%
- Complexity 6091 6116 +25
============================================
Files 2036 2090 +54
Lines 109727 112169 +2442
Branches 16782 17080 +298
============================================
+ Hits 70338 77180 +6842
+ Misses 34236 29505 -4731
- Partials 5153 5484 +331
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 401 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
3a29923 to
829cea0
Compare
829cea0 to
13457cc
Compare
| String getEndOffset(); | ||
|
|
||
| default Set<String> getAllColumns() { | ||
| default NavigableSet<String> getAllColumns() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we directly use TreeSet instead of NavigableSet if there is no other implementation expected .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an ordered set, but not a TreeSet because the key set of the TreeMap is not a TreeSet
| .isEqual(_primaryKeyColumns, that._primaryKeyColumns) && EqualityUtils | ||
| .isEqual(_hasJSONColumn, that._hasJSONColumn); | ||
| //@formatter:off | ||
| return EqualityUtils.isEqual(_schemaName, that._schemaName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removed _fieldSpecMap and _hasJSONColumn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are derived fields (similarly we don't include _dimensionNames etc.). Ideally we should define them as transient, but java ser/de still need to serialize these fields.
xiangfu0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
|
A high level question, will this handle the scenario that during schema evolution, a new column is created with default value and the CRC is still not change? |
Piggy back another high level question: why adding a column needs to change the original segment :) |
|
Thank you for fixing this! Glad to know this PR got merged! |
In segment metadata and index map, store columns in alphabetical order so that the result is deterministic.
This is very important to ensure segment generated from the same data has the same CRC. When multiple servers consume from the same stream, it is critical that all segments generated has the same CRC, or server will need to download a new segment copy when restart due to CRC mismatch. This can cause significant longer server restart time.
Release Notes
Segment generated before/after this PR will have different CRC, so during the upgrade, we might get segments with different CRC from old and new consuming server. For the segment consumed during the upgrade, some downloads might be needed.