ARROW-14893: [C++] Allow creating GCS filesystem from URI#12475
ARROW-14893: [C++] Allow creating GCS filesystem from URI#12475emkornfield wants to merge 8 commits intoapache:masterfrom
Conversation
This is mostly adapted from s3. One place that it differs is it doesn't allow for username and password. The only thing that I could see supporting in the future is taking username as user to impersonate, but we can see if there is demand for that feature. I also added in two useful features in s3: - Option for setting a default location to create a bucket in. - Default key-value metadata to use for object creation if none is specified. Also fixed a typo in s3fs error message.
|
CC @coryan |
|
|
coryan
left a comment
There was a problem hiding this comment.
LGTM, thanks for all the fixes.
pitrou
left a comment
There was a problem hiding this comment.
Thanks @emkornfield . A couple minor comments and questions.
| } | ||
| return std::make_shared<LocalFileSystem>(options, io_context); | ||
| } | ||
| if (scheme == "gs") { |
There was a problem hiding this comment.
Can/should we also accept "gcs" for compatibility with https://gcsfs.readthedocs.io/en/latest/#integration ?
There was a problem hiding this comment.
Seems reasonable to me.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| "https"}; | ||
| "https", | ||
| /*default_bucket_location=*/{}, | ||
| /*default_metadata=*/nullptr}; |
There was a problem hiding this comment.
Instead of spelling all default parameter values explicitly, it would probably more readable and maintainable to set the non-default attributes individually e.g.;
GcsOptions options{};
options.credentials = std::make_shared<GcsCredentials>(google::cloud::MakeInsecureCredentials());
options.schema = "http";
return options;
cpp/src/arrow/filesystem/gcsfs.h
Outdated
|
|
||
| std::string endpoint_override; | ||
| std::string scheme; | ||
| // \brief Location to use for creating buckets. |
There was a problem hiding this comment.
What kind of value can a location have? Is it a "region" like in S3?
There was a problem hiding this comment.
There are locations that are not regions:
Probably not of much interest for analytics workloads, but who knows.
There was a problem hiding this comment.
I've always thought of multi-regions as magic regions instead of a different concept. Unfortunately, people do use multi-regions for analytics workloads and experience the associated pain points.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
This reverts commit 30bd844.
pitrou
left a comment
There was a problem hiding this comment.
+1, thanks for the update @emkornfield
|
Benchmark runs are scheduled for baseline = 53e1296 and contender = 8244568. 8244568 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Is any additional work required to make this accessible from pyarrow / |
|
This change hadn't been released. I think with this change datsets should work. I'll hopefully have a PR up this week for pyarrow wrappers |
This is mostly adapted from s3. One place that it differs
is it doesn't allow for username and password. The only thing
that I could see supporting in the future is taking username as
user to impersonate, but we can see if there is demand for that feature.
I also added in two useful features in s3:
is specified.
Also fixed a typo in s3fs error message.