GH-47380: [Python] Apply maps_as_pydicts to Nested MapScalar Values#47454
GH-47380: [Python] Apply maps_as_pydicts to Nested MapScalar Values#47454raulcd merged 5 commits intoapache:mainfrom
Conversation
|
|
|
|
1 similar comment
|
|
AlenkaF
left a comment
There was a problem hiding this comment.
Thanks for proposing this update!
Supporting map-to-dict conversion in nested cases is definitely useful. The overall change looks good to me — just one suggestion: consider simplifying the iterator used in the zip function, if possible.
python/pyarrow/scalar.pxi
Outdated
| for k, v in zip(self.values.field(self.type.key_field.name), self.values.field(self.type.item_field.name)): | ||
| key = k.as_py() |
There was a problem hiding this comment.
I think we can simplify this a bit:
| for k, v in zip(self.values.field(self.type.key_field.name), self.values.field(self.type.item_field.name)): | |
| key = k.as_py() | |
| for key, value in zip(self.keys(), self.values.field(self.type.item_field.name)): |
and maybe something for the second iterator?
There was a problem hiding this comment.
Thanks for the suggestion! I had indeed missed the existence of the keys method for simplifying this. Unfortunately I don't see any similar thing for directly iterating over the value fields inside the inner array, so I think that can't be simplified the same way.
52be573 to
f998961
Compare
f998961 to
60b963c
Compare
|
@github-actions crossbow submit -g python |
|
Revision: 60b963c Submitted crossbow builds: ursacomputing/crossbow @ actions-07e734916d |
| ty = pa.struct([ | ||
| pa.field('x', pa.map_(pa.string(), pa.int8())), | ||
| pa.field('y', pa.list_(pa.map_(pa.string(), pa.int8()))), | ||
| ]) |
There was a problem hiding this comment.
could we have both test example validations? We would validate the previous functionality is still working as it was before maintaining the old test.
There was a problem hiding this comment.
Sure, I restored the original test and tested this functionality in a new one.
There was a problem hiding this comment.
Sorry, it has taken some time to review. I've been quite busy.
Thanks for the fix and for the tests!
The change looks good but I have a nit question, I understand the recursive call for value.as_py but I am unsure I understand why what the issue with the previous iteration, maybe you could explain why the change is needed, potentially add a comment?
Thanks and sorry for taking some time.
| return list(self) | ||
| result_dict = {} | ||
| for key, value in self: | ||
| for key, value in zip(self.keys(), self.values.field(self.type.item_field.name)): |
There was a problem hiding this comment.
What was the issue with the previous iterator?
I am unsure I understand what was wrong with the previous iteration and what are we fixing here.
There was a problem hiding this comment.
Not the author of the MR, but the previous for key, value in self would result in a call of self.__iter__().
That thing is defined above this function, and does yield (k.as_py(), v.as_py()) directly. So it's hardcoded to the default maps_as_pydicts behaviour, which is incompatible to what we want here. The values, if they happen to be map types, basically would be yielded in the non-dict way (as the list of (key, value) tuples).
It's also not possible to adjust the __iter__() function because by definition it has to have no parameters, so it has to be opinionated in some sense about how to handle maps.
So in this case, we have to loop over the keys and values manually and then do the as_py() call on the value type with the correct maps_as_pydicts parameter ourselves.
There was a problem hiding this comment.
I see, that makes sense, thanks for the clarification, should we validate self.values is not None as we currently do on __iter__?
arr = self.values
if arr is None:
return
for k, v in zip(arr.field(self.type.key_field.name), arr.field(self.type.item_field.name)):
yield (k.as_py(), v.as_py())There was a problem hiding this comment.
I guess that would make sense, also adding a test for whether an empty map works could make sense? @jo-migo ?
There was a problem hiding this comment.
Sure, I have added an explicit check for self.values is None and corresponding test.
There was a problem hiding this comment.
As for the context, check out the PR description for a (hopefully) straightforward minimal example, but it's just like @jonded94 says. The problem with the current implementation is that round trips from python -> arrow -> python are broken at the moment for nested map columns ---
basically we want
{'x': {'a': {'1': 1}}} -> {'x': {'a': {'1': 1}}}
but the current behaviour with maps_as_pydicts is:
{'x': {'a': {'1': 1}}} -> {'x': {'a': [('1', 1)]}}
384420d to
985c315
Compare
|
@github-actions crossbow submit -g python |
raulcd
left a comment
There was a problem hiding this comment.
Looks good to me, I've triggered CI, I will wait for it to finish. I'll merge afterwards.
|
Revision: 45b5431 Submitted crossbow builds: ursacomputing/crossbow @ actions-dda8c557ef |
|
None of the CI failures are related to the change. |
|
After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 0d32975. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 22 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…lues (apache#47454) ### Rationale for this change Currently, the `maps_as_pydicts` parameter to `MapScalar.as_py` does not work on nested maps. See below: ``` import pyarrow as pa t = pa.struct([pa.field("x", pa.map_(pa.string(), pa.map_(pa.string(), pa.int8())))]) v = {"x": {"a": {"1": 1}}} s = pa.scalar(v, type=t) print(s.as_py(maps_as_pydicts="strict")) # {'x': {'a': [('1', 1)]}} ``` In this ^ case, I'd want to get the value: `{'x': {'a': {'1': 1}}}`, such that round trips would work as expected. ### What changes are included in this PR? Begin to apply the `maps_as_pydicts` to nested values in map types as well, update relevant test. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, just a user-facing fix. * GitHub Issue: apache#47380 Lead-authored-by: Johanna <[email protected]> Co-authored-by: zzkv <[email protected]> Co-authored-by: Johanna <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
Rationale for this change
Currently, the
maps_as_pydictsparameter toMapScalar.as_pydoes not work on nested maps. See below:In this ^ case, I'd want to get the value:
{'x': {'a': {'1': 1}}}, such that round trips would work as expected.What changes are included in this PR?
Begin to apply the
maps_as_pydictsto nested values in map types as well, update relevant test.Are these changes tested?
Yes
Are there any user-facing changes?
Yes, just a user-facing fix.