ARROW-8088: [C++][Dataset] Support dictionary partition columns#6641
ARROW-8088: [C++][Dataset] Support dictionary partition columns#6641bkietz wants to merge 3 commits intoapache:masterfrom
Conversation
|
@bkietz I added a python test case as well |
|
Thanks @jorisvandenbossche |
fsaintjacques
left a comment
There was a problem hiding this comment.
LGTM, except for the Json representation, I think we should shoot for the most obvious which in this case is parsing primitive values with dict encoding.
cpp/src/arrow/ipc/json_simple.cc
Outdated
There was a problem hiding this comment.
I'd expect the ArrayFromJson to just works with primitive values, e.g. ArrayFromJson(dict(uf8(), int8()), R"["horse", "cat", null]"); should just work. We don't need to encode the physical representation, unless I'm missing something.
There was a problem hiding this comment.
We need to encode the indices as well to support comparison with sliced dictionary arrays. For example:
// indices ordered as if assembled with DictionaryBuilder
// from ["whiskey", "tango", "foxtrot"]
auto array_wtf = ArrayFromJSON(dictionary(int8(), utf8()), R"([
[0, ["whiskey", "tango", "foxtrot"]],
[1],
[2]
])");
// decoded values are ["tango", "foxtrot"], but indices *are not*
// as if assembled with DictionaryBuilder from ["tango", "foxtrot"]
auto array_tf = array_wtf->Slice(1);
assert(ArrayFromJSON(dictionary(int8(), utf8()), R"([
[1, ["whiskey", "tango", "foxtrot"]],
[2]
])")->Equals(array_tf));There was a problem hiding this comment.
I'm glancing the string example and I can't intuitively understand what it maps to.
auto dict_type = dictionary(int8(), utf8());
// Obvious with a glance
auto dict_array = ArrayFromJSON(dict_type, R"(["whiskey", "tango", "foxtrot"])");
// Require cognitive overhead and reading the decoder to understand what's happening.
auto dict_array = ArrayFromJSON(dict_type, R"([
[0, ["whiskey", "tango", "foxtrot"]],
[1],
[2]
])");When precise physical layout is needed, I recommend adding a family of methods:
DictArrayFromJSON(type, std::shared_ptr<Array> values, std::string indices_string);
DictArrayFromJSON(type, std::string values_string, std::string indices_string);I'm suggesting this because we want tests to be trivial and succinct to read. The end result is that 80% of tests which don't need exact layout will be easy to read, while the edge case will have the special method.
There was a problem hiding this comment.
Okay, I'll extract DictArrayFromJSON
There was a problem hiding this comment.
Note: the ArrayFromJSON behavior you describe is highly non trivial to implement, so I'll leave that for a follow up.
There was a problem hiding this comment.
a553390 to
a3f3aa0
Compare
This PR necessitated adding
Scalar::Parse,MakeArrayFromScalar, andArrayFromJSONcases for dictionary type.