Support Predicate Pushdown for Parquet Lists (#2108)#2999
Support Predicate Pushdown for Parquet Lists (#2108)#2999tustvold merged 5 commits intoapache:masterfrom
Conversation
|
Will update with rep_levels support |
|
I will review this carefully tomorrow morning! |
|
|
||
| fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> { | ||
| fn read(&mut self, out: &mut Self::Slice, mut range: Range<usize>) -> Result<usize> { | ||
| let read_from_buffer = match self.buffer.is_empty() { |
|
|
||
| loop { | ||
| if self.buffer.is_empty() { | ||
| // Read SKIP_BUFFER_SIZE as we don't know how many to read |
|
|
||
| // Find end of record | ||
| while to_skip < self.buffer.len() && self.buffer[to_skip] != 0 { | ||
| to_skip += 1; |
| } | ||
|
|
||
| level_skip += to_skip; | ||
| if to_skip >= self.buffer.len() { |
There was a problem hiding this comment.
Yeah, the hope is that >= helps LLVM elide bound checks that follow. Not sure it makes any difference, just habit 😅
| let mut read = 0; | ||
| let mut decoded = vec![]; | ||
| let mut expected = vec![]; | ||
| while read < encoded.len() { |
| } | ||
| } | ||
|
|
||
| const SKIP_BUFFER_SIZE: usize = 1024; |
There was a problem hiding this comment.
@tustvold choose SKIP_BUFFER_SIZE: usize = 1024, 1024* i16 just 2Kb, which supported by AVX_256, so this is the best choice 🤔 (i am not familiar with this, could you give me some inforamtion😂)
There was a problem hiding this comment.
It's a somewhat arbitrary number, it tries to strike a balance between small enough to not be overly wasteful but large enough that we don't end up repeatedly reading too few values
|
Benchmark runs are scheduled for baseline = fc58036 and contender = e2c4199. e2c4199 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2108
Rationale for this change
This PR adds support for skipping repetition levels by adding a buffer to ColumnLevelDecoderImpl. This buffer is also used to speed up skipping definition levels, as it allows for vectorised decoding.
What changes are included in this PR?
Are there any user-facing changes?