Skip to content

GH-47380: [Python] Apply maps_as_pydicts to Nested MapScalar Values#47454

Merged
raulcd merged 5 commits intoapache:mainfrom
jo-migo:fix-nested-maps-as-pydicts
Oct 7, 2025
Merged

GH-47380: [Python] Apply maps_as_pydicts to Nested MapScalar Values#47454
raulcd merged 5 commits intoapache:mainfrom
jo-migo:fix-nested-maps-as-pydicts

Conversation

@jo-migo
Copy link
Contributor

@jo-migo jo-migo commented Aug 28, 2025

Rationale for this change

Currently, the maps_as_pydicts parameter to MapScalar.as_py does not work on nested maps. See below:

import pyarrow as pa

t = pa.struct([pa.field("x", pa.map_(pa.string(), pa.map_(pa.string(), pa.int8())))])
v = {"x": {"a": {"1": 1}}}
s = pa.scalar(v, type=t)
print(s.as_py(maps_as_pydicts="strict"))

# {'x': {'a': [('1', 1)]}}

In this ^ case, I'd want to get the value: {'x': {'a': {'1': 1}}}, such that round trips would work as expected.

What changes are included in this PR?

Begin to apply the maps_as_pydicts to nested values in map types as well, update relevant test.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, just a user-facing fix.

@jo-migo jo-migo requested review from AlenkaF, raulcd and rok as code owners August 28, 2025 16:32
@github-actions
Copy link

⚠️ GitHub issue #47380 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #47380 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #47380 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this update!

Supporting map-to-dict conversion in nested cases is definitely useful. The overall change looks good to me — just one suggestion: consider simplifying the iterator used in the zip function, if possible.

Comment on lines 1176 to 1177
for k, v in zip(self.values.field(self.type.key_field.name), self.values.field(self.type.item_field.name)):
key = k.as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this a bit:

Suggested change
for k, v in zip(self.values.field(self.type.key_field.name), self.values.field(self.type.item_field.name)):
key = k.as_py()
for key, value in zip(self.keys(), self.values.field(self.type.item_field.name)):

and maybe something for the second iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I had indeed missed the existence of the keys method for simplifying this. Unfortunately I don't see any similar thing for directly iterating over the value fields inside the inner array, so I think that can't be simplified the same way.

@jo-migo jo-migo force-pushed the fix-nested-maps-as-pydicts branch from f998961 to 60b963c Compare September 10, 2025 13:12
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 10, 2025
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Will wait for another review and then we can merge.

cc @raulcd @rok

@AlenkaF
Copy link
Member

AlenkaF commented Sep 11, 2025

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: 60b963c

Submitted crossbow builds: ursacomputing/crossbow @ actions-07e734916d

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Comment on lines 960 to 963
ty = pa.struct([
pa.field('x', pa.map_(pa.string(), pa.int8())),
pa.field('y', pa.list_(pa.map_(pa.string(), pa.int8()))),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have both test example validations? We would validate the previous functionality is still working as it was before maintaining the old test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I restored the original test and tested this functionality in a new one.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 18, 2025
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Sep 22, 2025
@github-actions github-actions bot added the awaiting change review Awaiting change review label Sep 22, 2025
@jo-migo jo-migo requested a review from raulcd September 23, 2025 07:20
@jo-migo
Copy link
Contributor Author

jo-migo commented Sep 23, 2025

@raulcd @rok could I get another review please?

@jonded94
Copy link
Contributor

@raulcd @rok just another gentle request for a review? 🥺

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it has taken some time to review. I've been quite busy.
Thanks for the fix and for the tests!
The change looks good but I have a nit question, I understand the recursive call for value.as_py but I am unsure I understand why what the issue with the previous iteration, maybe you could explain why the change is needed, potentially add a comment?
Thanks and sorry for taking some time.

return list(self)
result_dict = {}
for key, value in self:
for key, value in zip(self.keys(), self.values.field(self.type.item_field.name)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the issue with the previous iterator?
I am unsure I understand what was wrong with the previous iteration and what are we fixing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the author of the MR, but the previous for key, value in self would result in a call of self.__iter__().

That thing is defined above this function, and does yield (k.as_py(), v.as_py()) directly. So it's hardcoded to the default maps_as_pydicts behaviour, which is incompatible to what we want here. The values, if they happen to be map types, basically would be yielded in the non-dict way (as the list of (key, value) tuples).

It's also not possible to adjust the __iter__() function because by definition it has to have no parameters, so it has to be opinionated in some sense about how to handle maps.

So in this case, we have to loop over the keys and values manually and then do the as_py() call on the value type with the correct maps_as_pydicts parameter ourselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense, thanks for the clarification, should we validate self.values is not None as we currently do on __iter__?

        arr = self.values
        if arr is None:
            return
        for k, v in zip(arr.field(self.type.key_field.name), arr.field(self.type.item_field.name)):
            yield (k.as_py(), v.as_py())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would make sense, also adding a test for whether an empty map works could make sense? @jo-migo ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have added an explicit check for self.values is None and corresponding test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the context, check out the PR description for a (hopefully) straightforward minimal example, but it's just like @jonded94 says. The problem with the current implementation is that round trips from python -> arrow -> python are broken at the moment for nested map columns ---

basically we want

{'x': {'a': {'1': 1}}} -> {'x': {'a': {'1': 1}}}

but the current behaviour with maps_as_pydicts is:

{'x': {'a': {'1': 1}}} -> {'x': {'a': [('1', 1)]}}

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 6, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 6, 2025
@jo-migo jo-migo force-pushed the fix-nested-maps-as-pydicts branch from 384420d to 985c315 Compare October 6, 2025 14:27
@raulcd
Copy link
Member

raulcd commented Oct 6, 2025

@github-actions crossbow submit -g python

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I've triggered CI, I will wait for it to finish. I'll merge afterwards.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 6, 2025
@github-actions
Copy link

github-actions bot commented Oct 6, 2025

Revision: 45b5431

Submitted crossbow builds: ursacomputing/crossbow @ actions-dda8c557ef

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@raulcd raulcd merged commit 0d32975 into apache:main Oct 7, 2025
16 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label Oct 7, 2025
@raulcd
Copy link
Member

raulcd commented Oct 7, 2025

None of the CI failures are related to the change.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 0d32975.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 22 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Oct 15, 2025
…lues (apache#47454)

### Rationale for this change
Currently, the `maps_as_pydicts` parameter to `MapScalar.as_py` does not work on nested maps. See below:

```
import pyarrow as pa

t = pa.struct([pa.field("x", pa.map_(pa.string(), pa.map_(pa.string(), pa.int8())))])
v = {"x": {"a": {"1": 1}}}
s = pa.scalar(v, type=t)
print(s.as_py(maps_as_pydicts="strict"))

# {'x': {'a': [('1', 1)]}}
```

In this ^ case, I'd want to get the value: `{'x': {'a': {'1': 1}}}`, such that round trips would work as expected.

### What changes are included in this PR?
Begin to apply the `maps_as_pydicts` to nested values in map types as well, update relevant test.
 
### Are these changes tested?
Yes

### Are there any user-facing changes?
Yes, just a user-facing fix.

* GitHub Issue: apache#47380

Lead-authored-by: Johanna <[email protected]>
Co-authored-by: zzkv <[email protected]>
Co-authored-by: Johanna <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants