Add explicit column mask for selection in parquet: ProjectionMask (#1701)#1716
Add explicit column mask for selection in parquet: ProjectionMask (#1701)#1716tustvold merged 5 commits intoapache:masterfrom
ProjectionMask (#1701)#1716Conversation
284724e to
489d54a
Compare
parquet/src/schema/types.rs
Outdated
There was a problem hiding this comment.
This change to store the root index, as opposed to a copy of the root ptr makes it easier to convert from a root mask to a leaf mask
| /// | ||
| /// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]` | ||
| pub fn leaves( | ||
| schema: &SchemaDescriptor, |
There was a problem hiding this comment.
The mask could theoretically carry this along and use it as a sanity check, but I'm inclined to think if a user constructs a mask and then uses it on a different schema, it's not something we can reasonably be expected to handle sensibly
There was a problem hiding this comment.
As long as a useful error is produced, I agree this is fine behavior
| .header_as_schema() | ||
| .map(arrow::ipc::convert::fb_to_schema) | ||
| .ok_or(ArrowError("the message is not Arrow Schema".to_string())), | ||
| .ok_or(arrow_err!("the message is not Arrow Schema")), |
There was a problem hiding this comment.
Drive by cleanup to move to arrow_err! macro
There was a problem hiding this comment.
Using arrow_err is not obviously better to me: it is the same verbosity but now requires looking / knowing what the arrow_err! macro does
;0 But it is not worse either
There was a problem hiding this comment.
It's consistent with how errors are handled elsewhere in the crate, with arrow_err!, general_err!, etc...
Codecov Report
@@ Coverage Diff @@
## master #1716 +/- ##
==========================================
- Coverage 83.32% 83.31% -0.01%
==========================================
Files 195 196 +1
Lines 56023 55961 -62
==========================================
- Hits 46681 46625 -56
+ Misses 9342 9336 -6
Continue to review full report at Codecov.
|
|
As a happy side-effect this actually found and fixed a bug in the handling of nested projection pushdown in ParquetRecordBatchStream |
| Some(projection) => { | ||
| if let Some(col) = projection.iter().find(|x| **x >= num_columns) { | ||
| return Err(general_err!( | ||
| "column projection {} outside bounds of schema 0..{}", |
There was a problem hiding this comment.
This check was actually incorrect as it was checking against the arrow schema not the parquet schema. I think this demonstrates the footgun prone nature of the old API
ProjectionMask (#1701)
| /// | ||
| /// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]` | ||
| pub fn leaves( | ||
| schema: &SchemaDescriptor, |
There was a problem hiding this comment.
As long as a useful error is produced, I agree this is fine behavior
parquet/src/arrow/mod.rs
Outdated
| /// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project | ||
| #[derive(Debug, Clone)] | ||
| pub struct ProjectionMask { | ||
| /// A mask of |
There was a problem hiding this comment.
The comment seems to be truncated.
Also, since we have Bitmap and all the associated handling in Arrow, I wonder if it is worth using that (though a Vec<bool> is nice and simple
There was a problem hiding this comment.
Describing in the docstring that a mask of None means All is probably also a good idea as well as which schema a ProjectionMasks indexes refer to (I think Parquet)
There was a problem hiding this comment.
Also, since we have Bitmap and all the associated handling in Arrow, I wonder if it is worth using that (though a Vec is nice and simple
Let's stick with simple, and maybe if/when we promote this construct to arrow-rs we can switch to using Bitmap
|
|
||
| /// Create a [`ProjectionMask`] which selects only the specified leaf columns | ||
| /// | ||
| /// Note: repeated or out of order indices will not impact the final mask |
There was a problem hiding this comment.
Nice -- so the idea is that you enforce masks having in-order indicies by wrapping them in a ProjectionMask which enforces this invariant during construction 👍
| Self { mask: None } | ||
| } | ||
|
|
||
| /// Create a [`ProjectionMask`] which selects only the specified leaf columns |
There was a problem hiding this comment.
Can you please explain (or provide a link to something that explains) leaves and roots and what "order" they are in. I think it refers to the parquet schema (or maybe the arrow schema and types within Structs / LIsts / others nested types?)
| parquet_to_arrow_schema_by_columns( | ||
| parquet_schema, | ||
| 0..parquet_schema.columns().len(), | ||
| ProjectionMask::all(), |
| pub fn parquet_to_arrow_schema_by_columns( | ||
| parquet_schema: &SchemaDescriptor, | ||
| column_indices: T, | ||
| mask: ProjectionMask, |
There was a problem hiding this comment.
I wonder if this needs an owned mask or if it could be taken by reference
| mask: ProjectionMask, | |
| mask: &ProjectionMask, |
There was a problem hiding this comment.
Currently yes, it gets moved into the Visitor. Theoretically it could borrow and have lifetimes, but in most cases I suspect we have the mask by value anyway.
Edit: This might be an argument to move to arrow Bitmap, as that is internally refcounted... Future PR me thinks
| .header_as_schema() | ||
| .map(arrow::ipc::convert::fb_to_schema) | ||
| .ok_or(ArrowError("the message is not Arrow Schema".to_string())), | ||
| .ok_or(arrow_err!("the message is not Arrow Schema")), |
There was a problem hiding this comment.
Using arrow_err is not obviously better to me: it is the same verbosity but now requires looking / knowing what the arrow_err! macro does
;0 But it is not worse either
| // required int64 leaf5; | ||
|
|
||
| let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type)); | ||
| let mask = ProjectionMask::leaves(&parquet_schema, [3, 0, 4]); |
There was a problem hiding this comment.
this is much nicer to read / reason about
ProjectionMask (#1701)ProjectionMask (#1701)
Which issue does this PR close?
Closes #1701.
Closes #1653
Rationale for this change
The current API is confusing, surfacing errors at runtime, and is liable to accidental misuse - apache/datafusion#2453 and apache/datafusion#2543.
What changes are included in this PR?
This adds an explicit
ColumnMaskthat replaces the iterators of indices. This makes for a cleaner API that should hopefully make it harder to accidentally misuse.In particular it also adds the ability to construct a RecordReader based on a mask of root columns, as opposed to leaf columns. This is the core behind apache/datafusion#2543
Are there any user-facing changes?
Yes, this changes the arrow projection API