[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet?

When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.

A minimal recreation of the issue is as follows:
```java

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
    "A": np.dtype("int8"),
    "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)

tbl = pa.Table.from_pandas(
    df, 
)  
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()

pq.write_table(
    tbl,
    filesystem.open_output_stream(
        where,
        compression=None,
    ),
    version="2.0",
)

schema = tbl.schema

read_schema = pq.ParquetFile(
    filesystem.open_input_file(where),
).schema_arrow
```
By printing schema and read_schema, you can the inconsistency.

I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.

**Environment**: NAME="CentOS Linux"
VERSION="7 (Core)"
**Reporter**: [Gavin](https://issues.apache.org/jira/browse/ARROW-14767)

<sub>**Note**: *This issue was originally created as [ARROW-14767](https://issues.apache.org/jira/browse/ARROW-14767). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet? #30302

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet? #30302

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions