ARROW-15114: [C++] GcsFileSystem uses metadata for directory markers#11996
ARROW-15114: [C++] GcsFileSystem uses metadata for directory markers#11996coryan wants to merge 5 commits intoapache:masterfrom coryan:ARROW-15114-gcsfs-use-metadata-for-directory-markers
Conversation
|
The appveyor failures look unrelated. |
|
Rebased to resolve conflicts. |
|
The failure on |
|
Ping |
|
@emkornfield can you take a look? |
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| const google::cloud::Status& status) { | ||
| using ::google::cloud::StatusCode; | ||
| auto canonical = internal::EnsureTrailingSlash(path.full_path); | ||
| static bool IsDirectory(gcs::ObjectMetadata const& o) { |
There was a problem hiding this comment.
| static bool IsDirectory(gcs::ObjectMetadata const& o) { | |
| static bool IsDirectory(const gcs::ObjectMetadata& o) { |
nit
emkornfield
left a comment
There was a problem hiding this comment.
It seems like there might be a lot more branches add here then tested in the changes, not sure how critical these are if they are covered by the generic file system tests?
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| "arrow/gcsfs", "directory"))) | ||
| .status(); | ||
| const auto canonical = internal::RemoveTrailingSlash(name).to_string(); | ||
| auto object = client_.InsertObject( |
There was a problem hiding this comment.
nit, I think it would be clearer to spell out type.
| if (status.code() == GcsCode::kAlreadyExists) { | ||
| break; | ||
| if (dir.empty()) { | ||
| // We could not find any of the parent directories in the bucket, the last step is |
There was a problem hiding this comment.
is this consistent with what the s3 connector does. bucket creation seems pretty heavy weight to maybe do it accidentally
There was a problem hiding this comment.
Seems like this is what you do for s3 too:
arrow/cpp/src/arrow/filesystem/s3fs.cc
Lines 2305 to 2311 in b325ef7
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (dir.empty()) { | ||
| // We could not find any of the parent directories in the bucket, the last step is | ||
| // to find out if the bucket exists, and if necessary, create it | ||
| auto b = client_.GetBucketMetadata(bucket); |
There was a problem hiding this comment.
spelling out type here would be more readable I think.
There was a problem hiding this comment.
Done. Come join the almost-always-auto world, it is nice here 😀
There was a problem hiding this comment.
its not that it isn't tempting, just not in the style guide used for this project.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| Status Move(const GcsPath& src, const GcsPath& dest) { | ||
| if (src.full_path.empty() || src.object.empty() || | ||
| src.object.back() == internal::kSep) { | ||
| if (src == dest) return {}; |
There was a problem hiding this comment.
nit: return Status::OK() is more idiomatic for the code base
There was a problem hiding this comment.
Fixed here and elsewhere.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (!info.IsFile()) { | ||
| return Status::IOError("Only files can be opened as input streams"); | ||
| if (info.IsDirectory()) { | ||
| return Status::IOError("Cannot open a directory as an input stream"); |
There was a problem hiding this comment.
maybe include the path requested here?
There was a problem hiding this comment.
Fixed. I also searched for any other places where I was not returning the affected path with the IOError.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| Result<std::shared_ptr<io::RandomAccessFile>> GcsFileSystem::OpenInputFile( | ||
| const FileInfo& info) { | ||
| if (info.IsDirectory()) { | ||
| return Status::IOError("Cannot open a directory as an input stream"); |
There was a problem hiding this comment.
same comment above about providing the path?
|
Looks like all failures are unrelated. I'll merge this on Monday if there are no further comments. |
pitrou
left a comment
There was a problem hiding this comment.
My main concern here is: does this reduce interoperability with filesystem "hierarchies" created by other GCS clients?
| // To find out if the directory exists we need to perform an additional query. | ||
| ARROW_ASSIGN_OR_RAISE(auto directory, GetFileInfo(p)); | ||
| if (directory.IsDirectory()) return result; | ||
| return Status::IOError("No such file or directory '", select.base_dir, "'"); |
There was a problem hiding this comment.
If the error message accurate? If base_dir is an existing file, do we still get this message?
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (!status.ok()) { | ||
| return status; | ||
| } | ||
| for (auto d = missing_parents.rbegin(); d != missing_parents.rend(); ++d) { |
There was a problem hiding this comment.
For the record, is it useful to iterate these in reverse order? Is it to avoid creating a/b if a exists as a file?
There was a problem hiding this comment.
Yes. Added a comment to explain that.
There was a problem hiding this comment.
Is the comment a bit off? It seems missing_parents is created in reverse depth order (from child to parent), so iterating it in reverse order visits the ancestors first and the descendents last?
| [](FileInfo const& info) { | ||
| if (!info.IsDirectory()) return info; | ||
| return Dir(internal::RemoveTrailingSlash(info.path()).to_string()); | ||
| }); |
There was a problem hiding this comment.
This is non-trivial enough to be factored out in a separate function, IMHO.
This is how I understand the problem:
That is, both solutions have downsides. If you want things to be more compatible with the GCS UI we can certainly do that. I think the really interesting case is working when there are no markers at all, which we can get to work in both cases. |
Great, I have no problem with this approach then. |
|
Benchmark runs are scheduled for baseline = aa59d17 and contender = 2164b0b. 2164b0b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.