Add integration test for scan rows with selection by Ted-Jiang · Pull Request #2158 · apache/arrow-rs

Ted-Jiang · 2022-07-25T06:58:14Z

Which issue does this PR close?

Closes #2106 .

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2022-07-25T07:25:51Z

Codecov Report

Merging #2158 (881752c) into master (1621c71) will increase coverage by 0.21%.
The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master    #2158      +/-   ##
==========================================
+ Coverage   82.85%   83.06%   +0.21%     
==========================================
  Files         237      237              
  Lines       61381    61620     +239     
==========================================
+ Hits        50856    51186     +330     
+ Misses      10525    10434      -91

Impacted Files	Coverage Δ
...et/src/arrow/array_reader/byte_array_dictionary.rs	`85.76% <0.00%> (-1.21%)`	⬇️
parquet/src/arrow/array_reader/null_array.rs	`70.96% <0.00%> (-10.52%)`	⬇️
...uet/src/arrow/array_reader/complex_object_array.rs	`94.02% <66.66%> (+0.82%)`	⬆️
parquet/src/arrow/array_reader/byte_array.rs	`86.20% <75.00%> (+0.41%)`	⬆️
parquet/src/arrow/array_reader/primitive_array.rs	`89.83% <75.00%> (+0.81%)`	⬆️
parquet/src/column/reader.rs	`68.79% <83.33%> (+5.92%)`	⬆️
parquet/src/arrow/arrow_reader.rs	`95.72% <97.08%> (+2.95%)`	⬆️
parquet/src/arrow/record_reader/mod.rs	`94.22% <100.00%> (+4.62%)`	⬆️
arrow/src/array/iterator.rs	`86.45% <0.00%> (-13.55%)`	⬇️
arrow/src/util/decimal.rs	`86.92% <0.00%> (-4.59%)`	⬇️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

Ted-Jiang · 2022-07-25T07:38:48Z

parquet/src/arrow/arrow_reader.rs

                    Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop


Fix wrong logic, remaining record need read

Ted-Jiang · 2022-07-25T07:39:28Z

parquet/src/arrow/record_reader/mod.rs

+            None => {
+                // If we skip records before all read operation
+                // we need set `column_reader` by `set_page_reader`
+                if let Some(page_reader) = pages.next() {


Fix skip before all read operator, need set column_reader

Ted-Jiang · 2022-07-25T08:50:30Z

@tustvold @thinkharderdev PTAL😊

tustvold

Had a brief look, will review in more detail later (flying today)

parquet/src/arrow/arrow_reader.rs

parquet/src/arrow/record_reader/mod.rs

tustvold · 2022-07-25T14:44:11Z

parquet/src/arrow/record_reader/mod.rs

+                // we need set `column_reader` by `set_page_reader`
+                if let Some(page_reader) = pages.next() {
+                    self.set_page_reader(page_reader?)?;
+                    false


This is wrong, as it will now only mark end_of_column when it reaches the end of the file, instead of the end of a column chunk within a row group. This will break record delimiting for repeated fields.

@tustvold i move it out to

fn skip_records(&mut self, num_records: usize) -> Result<usize> { if self.record_reader.column_reader().is_none() { // If we skip records before all read operation // we need set `column_reader` by `set_page_reader` if let Some(page_reader) = self.pages.next() { self.record_reader.set_page_reader(page_reader?)?; } else { return Ok(0); } } self.record_reader.skip_records(num_records) }

I think in this situation , only skip the first page without read any record the column_reader is none. related #2171 if
we create it in colchunk, then we will remove this check.

tustvold · 2022-07-25T14:44:53Z

parquet/src/column/reader.rs

                // If page has less rows than the remaining records to
                // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {


Why is this necessary, there is already an outer while loop?

because first add below

// because self.num_buffered_values == self.num_decoded_values means // we need reads a new page and set up the decoders for levels self.read_new_page()?;

if we still use if, we may read needless page header

This while loop should result in the same behaviour as the previous continue??

Oh... it's an useless loop

alamb · 2022-07-25T17:03:13Z

Thank you @Ted-Jiang -- the project to add page index and skipping is really coming along very nicely. It is a very nice piece of work.

tustvold · 2022-07-26T14:31:16Z

parquet/src/arrow/array_reader/byte_array.rs

    }

    fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        if self.record_reader.column_reader().is_none() {


This now behaves differently from next_batch which will potentially read from multiple column chunks for the same "batch". Can we extract this logic into a free function, similar to read_records, that performs the same loop?

This would also avoid duplicating this code in every ArrayReader

tustvold · 2022-07-26T14:32:34Z

parquet/src/arrow/arrow_reader.rs

+                        selection.push_front(RowSelection::select(remaining));
                        self.batch_size
                    }
+                    Some(_) => self.batch_size,


Suggested change

Some(_) => self.batch_size,

_ => self.batch_size,

And remove the None case below. If remaining == 0 then front.row_count == self.batch_size

yes more elegance 👍

tustvold · 2022-07-26T14:45:52Z

parquet/src/column/reader.rs

                }
+                // because self.num_buffered_values == self.num_decoded_values means
+                // we need reads a new page and set up the decoders for levels
+                self.read_new_page()?;


Perhaps we could check the return type of this, and short-circuit if it returns false?

ursabot · 2022-07-27T09:31:46Z

Benchmark runs are scheduled for baseline = e96ae8a and contender = d10d962. d10d962 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Ted-Jiang · 2022-07-27T11:04:15Z

@tustvold thanks for your patient review 👍

tustvold · 2022-07-27T14:11:08Z

FYI I'm working on a follow up PR to address some stuff, e.g. get this integrated into the fuzz tests

Ted-Jiang added 5 commits July 25, 2022 10:07

add test

b701d22

fix some skip bug.

af8b54f

add it.

333288d

fix skip in head.

f277b24

refine test case

4b27280

github-actions bot added the parquet Changes to the parquet crate label Jul 25, 2022

fix fmt.

7153bee

fix clippy

25cb93d

Ted-Jiang commented Jul 25, 2022

View reviewed changes

tustvold reviewed Jul 25, 2022

View reviewed changes

parquet/src/arrow/arrow_reader.rs Outdated Show resolved Hide resolved

parquet/src/arrow/record_reader/mod.rs Outdated Show resolved Hide resolved

tustvold reviewed Jul 25, 2022

View reviewed changes

fix comment.

881752c

tustvold reviewed Jul 26, 2022

View reviewed changes

fix comment

3f991d4

tustvold merged commit d10d962 into apache:master Jul 27, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this pull request Jul 27, 2022

Cleanup record skipping logic and tests (apache#2158)

2423211

tustvold mentioned this pull request Jul 27, 2022

Cleanup record skipping logic and tests (#2158) #2199

Merged

tustvold added a commit that referenced this pull request Jul 28, 2022

Cleanup record skipping logic and tests (#2158) (#2199)

cc96687

Conversation

Ted-Jiang commented Jul 25, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang commented Jul 25, 2022

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 25, 2022

Uh oh!

tustvold Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ursabot commented Jul 27, 2022

Uh oh!

Ted-Jiang commented Jul 27, 2022

Uh oh!

tustvold commented Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jul 25, 2022 •

edited

Loading

tustvold Jul 26, 2022 •

edited

Loading

Ted-Jiang Jul 27, 2022 •

edited

Loading