Skip to content

Arrow: geometry columns don't read correctly if Arrow extension types are registered #5834

@jorisvandenbossche

Description

@jorisvandenbossche

When an Arrow extension type gets registered to the Arrow C++ library using the name (eg "geoarrow.multipolygon") that GDAL is also using in the Arrow files, reading the geometry column no longer works.

(discovered by @paleolimbot, but I was able to construct a relatively minimal python reproducer)

Steps to reproduce the problem.

I am using the file at https://github.com/OSGeo/gdal/blob/master/autotest/ogr/data/arrow/from_paleolimbot_geoarrow/multipolygon-default.feather to reproduce this.

Reading this file works fine (I am using pyogrio built against gdal master, but this should work with the ogr python bindings as well):

In [1]: import pyogrio

In [2]: pyogrio.read_dataframe("../Downloads/multipolygon-default.feather")
Out[2]: 
   row_num                                           geometry
0        1  MULTIPOLYGON (((30.00000 10.00000, 40.00000 40...
1        2  MULTIPOLYGON (((30.00000 20.00000, 45.00000 40...
2        3  MULTIPOLYGON (((40.00000 40.00000, 20.00000 45...
3        4                                 MULTIPOLYGON EMPTY
4        5                                               None

However, after registering an extension type with the "geoarrow.multipolygon" name (which is used in the arrow file for the geometry column), it no longer reads correctly.

A minimal reproducer to create and register an Arrow extension type from python using pyarrow:

import pyarrow as pa

_point_storage_type = pa.list_(pa.field("xy", pa.float64()), 2)
_polygon_storage_type = pa.list_(
    pa.field("rings", pa.list_(pa.field("vertices", _point_storage_type)))
)
_multipolygon_storage_type = pa.list_(pa.field("polygons", _polygon_storage_type))


class MultiPolygonGeometryType(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, _multipolygon_storage_type, "geoarrow.multipolygon")

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return cls()


multipolygon_type = MultiPolygonGeometryType()
pa.register_extension_type(multipolygon_type)

After running that in the interactive python session, trying to read the file again now fails (it skips the geometry column):

In [4]: pyogrio.read_dataframe("../Downloads/multipolygon-default.feather")
Warning 1: Geometry column geometry has a type != list<polygons: list<rings: list<vertices: fixed_size_list<xy: double>[2]>>>: extension. Handling it as a regular field
Warning 1: Field geometry of unhandled type extension<geoarrow.multipolygon<MultiPolygonGeometryType>> ignored
Out[4]: 
   row_num
0        1
1        2
2        3
3        4
4        5

See https://arrow.apache.org/docs/dev/python/extending_types.html#defining-extension-types-user-defined-types for some more details about defining those user defined types from Python.
In general, the reason to "register" those types from eg Python or R, is that this way you can customize the interaction with this type from Python/R (eg conversion from/to more native Python or R objects). Example projects that are currently experimenting with registering such extension types are https://github.com/paleolimbot/geoarrow and https://github.com/jorisvandenbossche/python-geoarrow/

One of the things that happens if such a type is registered, is that Arrow C++, when opening the arrow file, will check for the 'ARROW:extension:name' field metadata, if that is present see if that name is registered, and if so, use an arrow::ExtensionType instead of the nested list type, and remove the field metadata.
So I suppose this is the reason that GDAL now sees a type it does not recognize for a column that is supposed to be a geometry column (based on the "geo" metadata). One possible fix might be to check for such extension types (type.id() == arrow::Type::EXTENSION in IsListOfPointType/IsPointType), and in that case assert the name is correct and then further work with the storage type.

Operating system

Ubuntu 20.04

GDAL version and provenance

GDAL built from master:

$ gdalinfo --version
GDAL 3.6.0dev-8ebc0d59bf-dirty, released 2022/05/31

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions