ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type#7536
ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type#7536bkietz wants to merge 4 commits intoapache:masterfrom
Conversation
…ferred with dictionary type
|
We could use just int32() dictionary indices and call it a day? |
|
I think there's value in finding the smallest index type possible; we expect partition fields to have few unique values in most cases. |
|
Actually, on reflection: I'm not sure it's worthwhile to check the count of unique values at all. In any given batch a virtual column would be materialized with a single-item dictionary so Currently it seems preferable to remove |
|
Currently for the ParquetDataset, it also simply uses int32 for the indices. Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?). To illustrate with a small dataset with In [1]: import pyarrow.dataset as ds
In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part)
In [10]: fragment = list(dataset.get_fragments())[0]
In [11]: fragment.to_table(schema=dataset.schema)
Out[11]:
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int8, ordered=0>
# only A included
In [13]: fragment.to_table(schema=dataset.schema).column("part")
Out[13]:
<pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
[
-- dictionary:
[
"A"
]
-- indices:
[
0,
0
]
]
In [15]: import pyarrow.parquet as pq
In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")
In [19]: piece = dataset2.pieces[0]
In [25]: piece.read(partitions=dataset2.partitions)
Out[25]:
pyarrow.Table
dummy: int64
part: dictionary<values=string, indices=int32, ordered=0>
# both A and B included
In [26]: piece.read(partitions=dataset2.partitions).column("part")
Out[26]:
<pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
[
-- dictionary:
[
"A",
"B"
]
-- indices:
[
0,
0
]
]I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?) |
|
@jorisvandenbossche okay, I'll extend the key value |
|
I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in DictType, etc... I suspect the short-time net effect of this is uncovering index type issues we have around. |
|
Int32 indices are now used whatever the dictionary size |
|
@bkietz thanks for the update ensuring all uniques as dictionary values! Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608 Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like It might also be an option to keep the |
|
I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict). |
Opened https://issues.apache.org/jira/browse/ARROW-9345 to track this. |
A maximum cardinality for the inferred dictionary can be specified: