[C++][Parquet] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20

### Describe the bug, including details regarding any error messages, version, and platform.

This issue was first posed as a question in #30302. I repeat the text here for convenience.

# Overview

I have a directory of parquet files. For a specific categorical column, some parquet files use int8 and some use int16. In pyarrow 19.0.1, reading the directory as a dataset succeeds. But with pyarrow 20, it fails with the below error when loading data from the dataset directory

# Reader code (python)

Either

```python
        import pandas as pd
        df = pd.read_parquet(
            path,
            engine="pyarrow",
        )
```

or

```
    import pyarrow.dataset as dataset
    dataset = dataset.dataset(path, format="parquet")
    table = dataset.to_table()
    df = table.to_pandas()
```

# Traceback

  ```python
  File "/app/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1475, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3941, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 731 not in range: -128 to 127
```

# Parquet Metadata

This shows the dictionary info of the parquet files in the directory:

```python
>>> import pyarrow.dataset as dataset
>>> ds = dataset(path)
>>> for path in ds.files:
...     sch = pq.read_schema(path)
...     print(path, sch.field('ExpStartDate').type)
... 
dataframes.parq/00eac90ef2f504223a74498405e060a48.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0641c30f725cd448bafc335d36cd01f6b.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0cb2799478dd54c738efe76fdc1875326.parquet dictionary<values=string, indices=int8, ordered=0>
dataframes.parq/0cff477be69be4ee093d98728d4f84452.parquet dictionary<values=string, indices=int16, ordered=0>
dataframes.parq/0d103de6323904e93aecf24589c12a370.parquet dictionary<values=string, indices=int8, ordered=0>
```

Is my issue related to the change in #30302 ? Is there a way to restore the previous behaviour of upcasting to int32 on read? Or what is the preferred workaround? It is going to be quite tedious to have to force all my writes to use int32, and especially for migrating huge volumes of historical data. For now we remain on pyarrow 19.0.1, but at some point we would like to upgrade.

### Component(s)

C++, Parquet, Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20 #46629

Describe the bug, including details regarding any error messages, version, and platform.

Overview

Reader code (python)

Traceback

Parquet Metadata

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Integer dictionary bitwidth preservation breaks multi-file read behaviour in pyarrow 20 #46629

Description

Describe the bug, including details regarding any error messages, version, and platform.

Overview

Reader code (python)

Traceback

Parquet Metadata

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions