Support nested schema projection (#5148)#5149
Conversation
| /// ]); | ||
| /// assert_eq!(filtered, expected); | ||
| /// ``` | ||
| pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self { |
There was a problem hiding this comment.
Originally I had two methods, one which was called for all Fields, however, it wasn't clear to me how such an API could be used - as it would lack any context as to where in the schema a given field appeared. I therefore just stuck with filter_leaves which has immediate concrete utility for #5135. A case could be made for a fully-fledged visitor pattern, but I'm not aware of any immediate use-cases for this, and therefore what the requirements might be for such an API.
FYI @Jefffrey
There was a problem hiding this comment.
I wonder how we would support selecting a subfield that is itself a field?
For example:
{
a : {
b: {
c: 1,
},
d: 2
}
Could I use this API to support only selecting a.b?
{
a : {
b: {
c: 1,
},
}
Maybe we could if the parent FieldRef was also passed
/// calls filter(depth, parent_field, child_field)`
pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
...
}🤔
There was a problem hiding this comment.
You would select b by selecting all its children, if you call filter with intermediate fields you end up in a position of how much context do you provide, just the parent, what about the parent's parent - reasoning in terms of leaves is the only coherent mechanism I can think of
arrow-schema/src/fields.rs
Outdated
|
|
||
| /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children | ||
| /// | ||
| /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a |
There was a problem hiding this comment.
I'm sure this could be expressed better, but I struggled to come up with something 😅
alamb
left a comment
There was a problem hiding this comment.
Thank you @tustvold -- this seems like a good API to me -- I tried to suggest some comment improvements and had some questions about making is more general, but I think since it is better than what currently exists, this is a nice step forward
arrow-schema/src/fields.rs
Outdated
| .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b)) | ||
| } | ||
|
|
||
| /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children |
There was a problem hiding this comment.
Here is a proposal of how to better phrase what this is doing (as well as give a usecase that might not be obvious to the casual reader)
| /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children | |
| /// Returns a copy of this [`Fields`], with only those non nested (leaf) [`FieldRef`]s that | |
| /// pass a predicate. | |
| /// | |
| /// This can be used to select only a subset of fields in nested types such as | |
| /// [`DataType::Struct`]. | |
| /// Leaf [`FieldRef`]s `DataType`s have no nested `FieldRef`s`. For example | |
| /// a field with [`DatatType::Int32`] would be a leaf. |
There was a problem hiding this comment.
I incoporated this into 8d97950. I tweaked the wording a little as the depth-first part is critical to understanding what the leaf ordering means. I also added some comments to the doctest to highlight
There was a problem hiding this comment.
Thank you -- the comments are most helpful
arrow-schema/src/fields.rs
Outdated
| /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a | ||
| /// count of the number of previous calls to `filter` - i.e. the leaf's index. |
There was a problem hiding this comment.
| /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a | |
| /// count of the number of previous calls to `filter` - i.e. the leaf's index. | |
| /// Invokes `filter` with each leaf [`FieldRef`] and a | |
| /// count of the depth to the `field` (i.e. the number of previous calls to `filter`, and the leaf's index.) |
There was a problem hiding this comment.
This isn't correct, it is the count of leaves encountered, not the depth
| /// ]); | ||
| /// assert_eq!(filtered, expected); | ||
| /// ``` | ||
| pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self { |
There was a problem hiding this comment.
I wonder how we would support selecting a subfield that is itself a field?
For example:
{
a : {
b: {
c: 1,
},
d: 2
}
Could I use this API to support only selecting a.b?
{
a : {
b: {
c: 1,
},
}
Maybe we could if the parent FieldRef was also passed
/// calls filter(depth, parent_field, child_field)`
pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
...
}🤔
| Dictionary(k, v) => (Some(k.clone()), v.as_ref()), | ||
| d => (None, d), | ||
| }; | ||
| let d = match v { |
There was a problem hiding this comment.
Map and RunEndEncoded appear to have embedded fields too. It seems like they might also need to be handled 🤔
The usecase might be "I want only the values of the map? Or maybe Arrow can't express that concept (Apply the filter to only the values 🤔 )
There was a problem hiding this comment.
Map is there, I missed RunEndEncoded 👍
arrow-schema/src/fields.rs
Outdated
| use DataType::*; | ||
|
|
||
| let (k, v) = match f.data_type() { | ||
| Dictionary(k, v) => (Some(k.clone()), v.as_ref()), |
There was a problem hiding this comment.
this presumes the key itself doesn't have any leaves which I think is reasonable, but figured I would point it out.
|
Looks very nice @tustvold -- thank you |
|
One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)? If not, I can file a ticket tracking that feature request |
Yes, I responded to the comment as such? |
Sorry -- the github UI is confusing me. I see the response in #5149 (comment) now. Thanks
|

Which issue does this PR close?
Closes #5148.
Rationale for this change
See ticket
What changes are included in this PR?
Are there any user-facing changes?