ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning#7608
ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning#7608jorisvandenbossche wants to merge 4 commits intoapache:masterfrom
Conversation
455f6dc to
9f0a90b
Compare
There was a problem hiding this comment.
The approach here is to also determine field_names_ in HivePartitioningFactory after inspecting (for DirectoryPartitioningFactory, those field names are passed to the constructor). So that we can then trim the schema and have the dictionaries match the order of the schema.
However, thinking of it now: there might still be a problem if the user specified the full dataset schema so no inspection happens .. So we might need to think of a better solution.
(I should also add some C++ tests)
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
I should probably guard here against the case that field_names_ was not yet updated (if Finish is called without Inspect being called), with empty vector?
There was a problem hiding this comment.
Absolutely, the first line of this method should just call
auto field_names = FieldNames();
and replace occurrences of the private member.
There was a problem hiding this comment.
There is no FieldNames() method on the PartitioningFactory (only the impl has one, but that is not accessible here; that's the reason I added the field_names_ private member to store those)
…coding for HivePartioning
9f0a90b to
81aecfa
Compare
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
No description provided.