-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.
A minimal recreation of the issue is as follows:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
"A": np.dtype("int8"),
"B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)
tbl = pa.Table.from_pandas(
df,
)
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()
pq.write_table(
tbl,
filesystem.open_output_stream(
where,
compression=None,
),
version="2.0",
)
schema = tbl.schema
read_schema = pq.ParquetFile(
filesystem.open_input_file(where),
).schema_arrowBy printing schema and read_schema, you can the inconsistency.
I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.
Environment: NAME="CentOS Linux"
VERSION="7 (Core)"
Reporter: Gavin
Note: This issue was originally created as ARROW-14767. Please see the migration documentation for further details.