ARROW-15121: [C++] Implement max recursion on GcsFileSystem#11969
ARROW-15121: [C++] Implement max recursion on GcsFileSystem#11969coryan wants to merge 3 commits intoapache:masterfrom coryan:ARROW-15121-gcsfs-implement-max-recursion
Conversation
Recursive listing without limit is about as expensive as only listing the top-level directories in GCS. Therefore, it is just *more* efficient to filter the results on the client-side, as this requires fewer request than listing only only 0, 1, or N levels in a directory hierarchy. I also improved the tests to verify no objects with similar prefixes are returned, for example, when listing objects starting with 'aaa' we do not want 'aaab', but we want 'aaa/b'.
pitrou
left a comment
There was a problem hiding this comment.
Thank you! Just two minor comments.
| return result; | ||
| } | ||
|
|
||
| std::size_t Depth(arrow::util::string_view path) { |
There was a problem hiding this comment.
Nit, but as per the coding guidelines (which are copied over from the Google C++ style guide :-)), we generally use explicit-width types such as int32_t.
| arrow::fs::AssertFileInfo(fs.get(), PreexistingBucketPath(), FileType::Directory); | ||
| arrow::fs::AssertFileInfo(fs.get(), PreexistingObjectPath(), FileType::File); | ||
| for (auto const& info : hierarchy.contents) { | ||
| if (!fs::internal::IsAncestorOf(hierarchy.base_dir, info.path())) continue; |
There was a problem hiding this comment.
You could actually check that this entry still exists instead of ignoring it.
(same comment below too)
| arrow::fs::AssertFileInfo(fs.get(), PreexistingBucketPath(), FileType::Directory); | ||
| arrow::fs::AssertFileInfo(fs.get(), PreexistingObjectPath(), FileType::File); | ||
| for (auto const& info : hierarchy.contents) { | ||
| if (!fs::internal::IsAncestorOf(hierarchy.base_dir, info.path())) continue; |
There was a problem hiding this comment.
Can you do the same change as well here? (check the expected type)
There was a problem hiding this comment.
Fixed, sorry about that.
|
Benchmark runs are scheduled for baseline = e7c2ead and contender = 61745be. 61745be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Recursive listing without limit is about as expensive as only listing
the top-level directories in GCS. Therefore, it is just more efficient
to filter the results on the client-side, as this requires fewer request
than listing only only 0, 1, or N levels in a directory hierarchy.
I also improved the tests to verify no objects with similar prefixes are
returned, for example, when listing objects starting with 'aaa' we do
not want 'aaab', but we want 'aaa/b'.