ARROW-12644: [C++][Python][R][Dataset] URL-decode path segments in partitioning#10264
ARROW-12644: [C++][Python][R][Dataset] URL-decode path segments in partitioning#10264lidavidm wants to merge 10 commits intoapache:masterfrom
Conversation
7652fb5 to
6962fe6
Compare
|
Ah, and so it turns out there is indeed a very good reason to url-escape names even on local file systems: Windows doesn't like characters like |
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
I'm curious: is there any reason we don't utf8-validate here?
There was a problem hiding this comment.
Just because we didn't before, but perhaps we should.
There was a problem hiding this comment.
It would seem more consistent. I don't see any reason to make a difference between the two cases.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
I'm afraid at some point, we'll have to deal with other encodings (e.g. hex, or some system-specific oddity). Perhaps make this an enum instead, so that it's open-ended?
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Just FYI, I think KeyValuePartitioningOptions options = {} can be used as a shorter spelling.
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
Is there a reason you don't use the same spelling as below, i.e. segment.substr(name_end + 1)?
There was a problem hiding this comment.
Just to avoid a copy, but this can be done as string_view(segment).substr(...) instead.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Since this is a boolean option, I'd rather make it keyword-only, i.e. def __init__(..., *, bint url_decode_segments=True).
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Nit, but we use Uri in most of the code base (including uri.h :-)).
|
I added UTF-8 validation and renamed URL to URI. |
|
It seems there are CI failures on Windows. |
bkietz
left a comment
There was a problem hiding this comment.
LGTM, thanks for cleaning this up
Now by default, directory/hive partitioning will URL-decode potential partition values before trying to parse them, since systems like Spark apparently may URL-encode the values in some cases. Note for Hive partitioning, this applies only to the value, not to the key itself. This behavior can be toggled.