ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning#12977
ARROW-16302: [C++] Null values in partitioning field for FilenamePartitioning#12977sanjibansg wants to merge 24 commits intoapache:masterfrom
Conversation
cpp/src/arrow/dataset/discovery.cc
Outdated
There was a problem hiding this comment.
I wonder if we should make it the responsibility of the partitioning implementation to strip the filename, instead of hardcoding an exception. (After all, what if the user wants to define a custom filename-based partitioning scheme?) Or we could pass both the path and the filename separately to partitioning->Parse.
There was a problem hiding this comment.
So, it's like passing the entire info.path() in the Parse() method and then various partitioning implementations will do it in the way they want? I think we can then strip the prefix/filename in the ParseKeys method of each of the partitioning modes.
There was a problem hiding this comment.
Yeah, I think we can still strip the prefix, but we can let the partitioning handle the rest.
And in that case it might make sense to split the remainder of the path and the filename for the partitioning implementation too and just pass both as arguments.
There was a problem hiding this comment.
(We should still strip the prefix since presumably we don't want partitioning to depend on the prefix itself.)
There was a problem hiding this comment.
We may pass both the filename and the path(other than the prefix) to the Parse() method, but the filename is only required for FilenamePartitioning I believe, and the path will be required by the Directory & Hive Partitioning.
There was a problem hiding this comment.
Sure. From the point of view of datasets however it doesn't matter that some implementations only need some of the info.
CC @westonpace for thoughts
There was a problem hiding this comment.
With the latest change, I modified the StripPrefixAndFilename() method to return a PartitionPathFormat object which will contain both the directory and filename prefix and then passing that to the Parse() method which now expects both the directory and filename-prefix.
We can modify the Parse() method as well to accept an object of PartitionPathFormat that way it will be symmetrical to the Format() method. But then, we need to implement similar changes to PyArrow, and I believe then we have to define an object of PartitionPathFormat first to use the partitioning.parse() method in PyArrow.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Why do we need the default parameter values?
There was a problem hiding this comment.
Though yes, it would be better if we could pass const PartitionPathFormat& instead. FWIW I don't think we have to expose it to Python. Or we can just make a namedtuple on the Python side, it doesn't have to be a C++ class wrapper.
There was a problem hiding this comment.
- Removed the default parameter
- Modified the Parse method to use a
PartitionPathFormatobject as an argument. As for the PyArrow interface, modified theparse()method to accept just the two strings (directory & prefix) and then uses a cppclass object to form the PartitionPathFormat object which is then passed into the internalParse()method. Is this a good approach?
There was a problem hiding this comment.
Why aren't we passing both components?
There was a problem hiding this comment.
(Also, can we cover this with a test?)
There was a problem hiding this comment.
Now that we are using the PartitionPathFormat as the argument, the Parse() method will work accordingly as per the partitioning mode. Any particular test you want here?
There was a problem hiding this comment.
Making sure that a filename partitioning works properly in this path, basically, since before it seems like it would have failed since the filename was being omitted.
There was a problem hiding this comment.
I have added a round-trip test in PyArrow to check whether the partitions are read correctly. Do we need any tests other than that?
There was a problem hiding this comment.
Hmm, don't we still want the filename in the path in case the partitioning factory is a filename PartitioningFactory?
There was a problem hiding this comment.
(Can we cover this with a test?)
There was a problem hiding this comment.
This still seems off, but I can't figure out how to hit this case.
There was a problem hiding this comment.
I think we can do the same changes in the Inspect() method which currently accepts a path. Instead of passing a vector of strings, we can then pass a vector of PartitionPathFormat object, and then the Inspect methods of individual partitioning modes will use either the directory or the filename accordingly?
There was a problem hiding this comment.
That probably makes the most sense, but we might want to split out that refactoring separately, and also make sure that we can hit this path in a unit test in the first place (I was trying this morning but couldn't) as I don't want to expand the scope of this PR too much.
There was a problem hiding this comment.
The Inspect() methods now accept a PartitionPathFormat object as an argument with the latest commit. I have modified the tests and SplitFilenameAndPrefix() methods to return forms of PartitionPathFormat object as required. The Inspect() methods then extracts the directory or prefix accordingly whichever is required.
There was a problem hiding this comment.
With the changes in the Inspect() method, I think the R build is failing, I am trying to investigate on fixing it, but not very sure about the R implementation.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Hmm, not for this PR but I would expect FunctionPartitioning to be "as powerful as" any other partitioning. But I don't think it's used much anyways.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
nit but do both of these need a default? I can see prefix having a default because it's a new argument
There was a problem hiding this comment.
Also IMO the parameter should be named "filename", and/or we should have a docstring
There was a problem hiding this comment.
Previously, when the parse() method only accepted a path, we used to pass the filename there, as FilenamePartitioning only needs the filename and not the directory. But, now that it is expecting both a directory and filename prefix, so I believe a directory will not be required for FilenamePartitioning, thus using an empty string as default.
There was a problem hiding this comment.
Renamed prefix to filename in PartitionPathFormat
There was a problem hiding this comment.
This still seems off, but I can't figure out how to hit this case.
|
Thanks. @westonpace any comments here? |
|
@github-actions autotune |
westonpace
left a comment
There was a problem hiding this comment.
I was looking at the R change and it boils down to how we want to expose this to R. However, I don't think we want to change the interface to PartitioningFactory. It makes sense that Partitioning might change. We could justify it because users might create custom subclasses and we want to make it as easy as possible.
However, I think PartitioningFactory::Inspect should continue to take in std::vector<std::string>
…ngFactory__Inspect
4517671 to
6133121
Compare
8a6cc9c to
2ae2ea3
Compare
westonpace
left a comment
There was a problem hiding this comment.
This seems correct. One last question I think for me.
|
Benchmark runs are scheduled for baseline = f3e09b9 and contender = adb5b00. adb5b00 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR fixes the issue of the partitioning field having null values while using FilenamePartitioning.
For FilenamePartitioning, we should only remove the prefix and thus should not use
StripPrefixAndFilename(), which will remove the filename too along with the prefix.