Stub out Skip Records API (#1792) by tustvold · Pull Request #1998 · apache/arrow-rs

tustvold · 2022-07-03T02:24:53Z

Which issue does this PR close?

Part of #1792

Rationale for this change

Stubs out an API for providing skip records functionality within parquet. I think this will work to support #1792, #1191 and potentially other functionality down the line.

Let me know what you think @Ted-Jiang @sunchao

What changes are included in this PR?

Stubs out APIs for adding row skipping logic to the parquet implementation

Are there any user-facing changes?

No 🎉

codecov-commenter · 2022-07-03T02:52:03Z

Codecov Report

Merging #1998 (c81b77d) into master (c757829) will decrease coverage by 0.15%.
The diff coverage is 62.29%.

❗ Current head c81b77d differs from pull request most recent head 2a572d7. Consider uploading reports for the commit 2a572d7 to get more accurate results

@@            Coverage Diff             @@
##           master    #1998      +/-   ##
==========================================
- Coverage   83.58%   83.42%   -0.16%     
==========================================
  Files         222      222              
  Lines       57522    57906     +384     
==========================================
+ Hits        48078    48309     +231     
- Misses       9444     9597     +153

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader/byte_array.rs	`84.47% <0.00%> (-1.24%)`	⬇️
...et/src/arrow/array_reader/byte_array_dictionary.rs	`82.26% <0.00%> (-1.66%)`	⬇️
...uet/src/arrow/array_reader/complex_object_array.rs	`93.20% <0.00%> (-1.07%)`	⬇️
parquet/src/arrow/array_reader/empty_array.rs	`45.45% <0.00%> (-10.11%)`	⬇️
parquet/src/arrow/array_reader/list_array.rs	`92.69% <0.00%> (-0.72%)`	⬇️
parquet/src/arrow/array_reader/map_array.rs	`58.82% <0.00%> (-8.98%)`	⬇️
parquet/src/arrow/array_reader/mod.rs	`88.23% <ø> (ø)`
parquet/src/arrow/array_reader/null_array.rs	`81.48% <0.00%> (-6.52%)`	⬇️
parquet/src/arrow/array_reader/primitive_array.rs	`88.63% <0.00%> (-1.02%)`	⬇️
parquet/src/arrow/array_reader/struct_array.rs	`78.99% <0.00%> (-9.69%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c757829...2a572d7. Read the comment docs.

Ted-Jiang · 2022-07-04T01:59:01Z

cool! 👍 @tustvold Are you the Flash 😄! i will try to go through this and give your my opinion today.

Ted-Jiang · 2022-07-04T03:47:01Z

parquet/src/arrow/arrow_reader.rs

Could we add total_row_count to check this selection is valid(maybe like continuous)

Is it actually an issue if it isn't, e.g. if I only want the first 100 rows?

yes, got it, it should check in user side.

Ted-Jiang · 2022-07-04T03:50:33Z

parquet/src/column/page.rs

👍 really need this abstraction！

Ted-Jiang · 2022-07-04T03:57:16Z

parquet/src/arrow/arrow_reader.rs

👍 pass mask here not each col is more reasonable 😂

Ted-Jiang

👍 I think this abstraction is great ! Thanks for your effort！❤️

Left some comments, most are
Maybe after this pr merge, i will continue to work on page index.

Ted-Jiang · 2022-07-04T04:19:17Z

parquet/src/column/page.rs

Is there we only need offset index, without the min max index?🤔

parquet/src/arrow/record_reader/mod.rs

Ted-Jiang · 2022-07-04T08:29:57Z

parquet/src/arrow/record_reader/mod.rs

Is this for the situation a page which has been read_records but left some unreaded buffer?

Sorry, i don't get this point, why not directly call column_reader.skip_records(num_records)
could you give me some hint?

RecordReader is a bit of an odd cookie, let me try to explain what it is doing.

In the absence of repetition levels, it can simply read batch size levels, and the corresponding number of values.

However, if repetition levels are present, it will likely need to read more than batch_size levels in order to read batch_size actual records (rows).

To achieve this it reads to its internal buffer and then splits off the data corresponding to batch_size rows, leaving the excess behind.

It is this excess of data that has been read to its buffers but not yielded to the caller yet, which we must consume here

👍 nice write up ! Save me some time 😄!
So, i got it. More specific details to ask:
This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

let (buffered_records, buffered_values) = self.count_records(num_records); self.num_records += buffered_records; self.num_values += buffered_values; self.consume_def_levels(); self.consume_rep_levels(); self.consume_record_data(); self.consume_bitmap(); self.reset(); let remaining = buffered_records - num_records;

This also part of skip, remaining > 0, I think this we skip start at a new page

if remaining == 0 { return Ok(buffered_records); } let skipped = match self.column_reader.as_mut() { Some(column_reader) => column_reader.skip_records(remaining)?, None => 0, };

This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

Yes, this is just to consume the data that has been read to the internal buffers of RecordReader if any

This also part of skip, remaining > 0, I think this we skip start at a new page

Not necessarily, the only thing RecordReader needs to handle is skipping any data that has already been read from ColumnReader into its own buffers. It can then delegate to ColumnReader to skip the remaining rows, with no requirement that this is done at a page boundary - ColumnReader must be able to handle any case.

Co-authored-by: Yang Jiang <[email protected]>

alamb

The API looks good to me -- I had some questions and I think it would be nicer to return NotImplemented errors rather than panic in certain cases but I think this PR could also be merged as is to unblock further dev work

alamb · 2022-07-06T18:56:10Z

parquet/src/arrow/array_reader/byte_array.rs

    }
+
+    fn skip_values(&mut self, _num_values: usize) -> Result<usize> {
+        todo!()


I think adding a ticket reference here like
unimplemented!("See https://github.com/apache/arrow-rs/.....") would help future readers

Bonus points for returning ArrowError::Unimplemented

This comment applies to everything below as well

alamb · 2022-07-06T19:00:30Z

parquet/src/arrow/arrow_reader.rs

+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented


perhaps worth a ticket?

alamb · 2022-07-06T19:01:33Z

parquet/src/arrow/arrow_reader.rs

+/// [`RowSelection`] allows selecting or skipping a provided number of rows
+/// when scanning the parquet file
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct RowSelection {


You probably already have thought about this, but I would expect that in certain scenarios, non contiguous rows / skips would be desired

Like "fetch the first 100 rows, skip the next 200, and then fetch the remaining"

Would this interface handle that case?

See with_row_selection which takes a Vec to allow for this use-case

alamb · 2022-07-06T19:02:43Z

parquet/src/file/serialized_reader.rs

    }
+
+    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
+        todo!()


ditto returning "not yet implemented" would probably be nicer

alamb · 2022-07-06T19:03:54Z

parquet/src/arrow/record_reader/definition_levels.rs

 }

-pub struct DefinitionLevelDecoder {
+pub struct DefinitionLevelBufferDecoder {


I this rename a public API change as well? It does not appear in the docs

https://docs.rs/parquet/17.0.0/parquet/?search=DefinitionLevelDecoder

No it is crate local

github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2022

tustvold force-pushed the skip-records-api branch from c413686 to 0071931 Compare July 3, 2022 02:33

Ted-Jiang reviewed Jul 4, 2022

View reviewed changes

parquet/src/column/page.rs Outdated

Copy link

Member

Ted-Jiang Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 really need this abstraction！

Ted-Jiang reviewed Jul 4, 2022

View reviewed changes

parquet/src/arrow/arrow_reader.rs Outdated

Copy link

Member

Ted-Jiang Jul 4, 2022 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 pass mask here not each col is more reasonable 😂

Ted-Jiang approved these changes Jul 4, 2022

View reviewed changes

Ted-Jiang reviewed Jul 4, 2022

View reviewed changes

tustvold force-pushed the skip-records-api branch from 0071931 to 7527750 Compare July 5, 2022 12:40

tustvold marked this pull request as ready for review July 5, 2022 12:46

Stub API for parquet record skipping

7324873

tustvold force-pushed the skip-records-api branch from 7527750 to 7324873 Compare July 5, 2022 13:13

tustvold mentioned this pull request Jul 5, 2022

Enable serialized_reader read specific Page by passing row ranges. #1977

Closed

github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jul 5, 2022

tustvold and others added 2 commits July 5, 2022 09:23

Update parquet/src/arrow/record_reader/mod.rs

7996cd2

Co-authored-by: Yang Jiang <[email protected]>

Remove empty google.protobuf.rs

45cbee0

alamb reviewed Jul 6, 2022

View reviewed changes

alamb approved these changes Jul 6, 2022

View reviewed changes

tustvold added 2 commits July 7, 2022 09:16

Replace todo with nyi_err

d856240

Update doc comment

2a572d7

tustvold merged commit e59b023 into apache:master Jul 7, 2022

tustvold mentioned this pull request Jul 27, 2022

Add ArrayReader::skip_records API #2197

Closed

Ted-Jiang mentioned this pull request Aug 8, 2022

Combine multiple selections into the same batch size in skip_records #2358

Closed

Conversation

tustvold commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Ted-Jiang commented Jul 4, 2022

Uh oh!

Ted-Jiang Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tustvold commented Jul 3, 2022 •

edited

Loading

codecov-commenter commented Jul 3, 2022 •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang left a comment •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang Jul 5, 2022 •

edited

Loading

tustvold Jul 6, 2022 •

edited

Loading