Add GCS Bucket Mirrors by douglasjacobsen · Pull Request #26382 · spack/spack

douglasjacobsen · 2021-09-30T17:02:23Z

This pull request contains changes to support Google Cloud Storage buckets as mirrors, meant for hosting Spack build-caches. This feature is beneficial for folks that are running infrastructure on Google Cloud Platform. On public cloud systems, resources are ephemeral and in many cases, installing compilers, MPI flavors, and user packages from scratch takes up considerable time.

Giving users the ability to host a Spack mirror that can store build caches in GCS buckets offers a clean solution for reducing application rebuilds for Google Cloud infrastructure.

spackbot-app · 2021-09-30T17:02:29Z

Hi @douglasjacobsen! I noticed that the following package(s) don't yet have maintainers:

lua

Are you interested in adopting any of these package(s)? If so, simply add the following to the package class:

    maintainers = ['douglasjacobsen']

If not, could you contact the developers of this package and see if they are interested? You can quickly see who has worked on a package with spack blame:

$ spack blame lua

Thank you for your help! Please don't add maintainers without their consent.

You don't have to be a Spack expert or package developer in order to be a "maintainer," it just gives us a list of users willing to review PRs or debug issues relating to this package. A package can have multiple maintainers; just add a list of GitHub handles of anyone who wants to volunteer.

douglasjacobsen · 2021-09-30T17:02:47Z

This PR supersedes #24422

lib/spack/spack/fetch_strategy.py

var/spack/repos/builtin/packages/lua/package.py

lib/spack/spack/util/web.py

lib/spack/spack/util/gcs.py

scottwittenburg

This looks pretty good to me, but since it changes some things with respect to how S3 is handled, it's probably a good idea for @opadron to take a look.

scottwittenburg

I just noticed some cleanup jobs failing in gitlab pipelines, to do with the recursive argument to remove_url(). Here's a link to a failing job, though you can find it from the gitlab check on your PR as well.

https://gitlab.spack.io/spack/spack/-/jobs/1080302

opadron · 2021-09-30T18:35:32Z

Why do we need this? Isn't GCS S3-compatible?

douglasjacobsen · 2021-09-30T18:38:21Z

This allows us to use google-cloud-storage's python API without having a dependence on boto3.

However, users could choose to use the s3 backend, and configure it to point to GCS resources by changing the environment variables instead of using this native GCS backend.

scottwittenburg · 2021-09-30T18:38:57Z

Why do we need this? Isn't GCS S3-compatible?

Looks like it's adding a new url scheme, so I assumed GCS could be addressed with both url protocols.

opadron · 2021-09-30T18:52:27Z

This allows us to use google-cloud-storage's python API without having a dependence on boto3.

Ok, but to do that requires a dependency on the google-cloud-storage's python API. I'm not yet convinced this is worth adding on the basis of reducing dependencies because it doesn't.

Is there some kind of feature or capability that GCS provides that S3 does not? And, are we currently or planning to take advantage of those capabilities? Otherwise, I'm having a really hard time seeing why these changes are needed.

Maybe my reading of the above quote is off. Are you perhaps saying that using boto3 with GCS would also require the google-cloud-storage's python API?

scottwittenburg · 2021-09-30T19:02:52Z

I just noticed some cleanup jobs failing in gitlab pipelines, to do with the recursive argument to remove_url(). Here's a link to a failing job, though you can find it from the gitlab check on your PR as well.

I think the cleanup job uses mirror.destroy, which invokes the remove_url() method. Maybe a quick grep for remove_url in the repo would find any other call sites.

But @douglasjacobsen I'm curious if you saw some documentation indicating that we no longer need the special handling for recursive removal of a url? I would be happy to be enlightened if you could share a link. Thanks!

douglasjacobsen · 2021-09-30T19:05:47Z

Yes, using the gs:// URL scheme would require using the google-cloud-storage API rather than the boto3 API. Using boto3 with GCS does not require the google-cloud-storage API.

These changes more allow users who are already using the Google Cloud SDK to access GCS buckets using their SDK credentials and the default endpoint, similar to how the default endpoint for s3:// is Amazon's s3. So, the goal of this PR is primarily to ease the barrier of entry for people using google tools to access GCS buckets, and provide an implementation that mirrors what the s3 backend currently does (relying on Amazon tools and libraries).

douglasjacobsen · 2021-09-30T19:07:19Z

@scottwittenburg: Sorry, this is a bug. I'm going to work on fixing it asap. I think the version we started working on didn't have the recusrive destruction of these URLs, and when I rebased these changes didn't translate properly.

For the GCS backend, we could use force to try and make the API do a recursive destruction, however that fails if there are more than 256 objects in the bucket. So it's likely safer to iterate similar to how the s3 backend works.

douglasjacobsen · 2021-09-30T19:10:50Z

We should probably resolve the discussion about whether or not this is useful before I put more time into fixing the PR though. :)

opadron

I'm quite confused by these proposed changes. They concern me because I worry that it might be unnecessarily adding new code paths to access GCS, which as far as I understand, are already accessible via an s3:// URL and boto.

Besides the above concern, I notice that a number of the proposed changes modify how s3:// urls are handled as well as other minor details of how Spack operates without a clear reason. The PR author themselves notes that one of these changes are unrelated and should be removed. Perhaps this is supposed to be a WIP and not yet ready for review?

@douglasjacobsen if I am in error, please accept my apologies. Help us to understand your intent and what we can do to help.

lib/spack/spack/util/web.py

opadron · 2021-09-30T19:28:49Z

@douglasjacobsen so, I just noticed your comments from the last 15 minutes or so. Thanks for helping to clear up some of my questions.

I think a big part of what you're going for is a good developer experience for users of GCS tooling, is that right? I can see how it would be more convenient to just use existing tooling, but I think we should balance that desire against the maintenance cost of maintaining gcs and s3 url handling vs just s3 urls. What does the experience look like if someone using google cloud sdk needs to install boto and access GCS through our s3 url handling? Is there a lot of extra steps that the user would need to take? Other barriers to use? These are perhaps some things we should consider.

douglasjacobsen · 2021-09-30T19:36:12Z

@opadron Thanks for the discussion. And I should probably apologize for the changes that are irrelevant. This should not change the s3 backend, and it should not modify how spack works (i.e. the recursive removal is an accident).

This is a branch that was started a long time ago (#24422) and I've been working with the original author to fix some of the issues we had with it. The unrelated changes that are in the PR are simply an oversight on my part when I rebased the branch to fix some of the merge conflicts.

Regarding user experience, yes that's exactly what I'm going for. The steps a user needs to go through to access a GCS bucket with the existing s3 backend are something similar to the following (which honestly are not very well documented and it takes a lot for a naive user to figure out):

Install boto3 (obvious if you're using an s3 bucket, but seems awkward if you're already using google tooling)
Set the S3_ENDPOINT_URL env-var to https://storage.googleapis.com/download/storage/v1 (not very well documented anywhere)
Configure Boto to access GCS resources (something like this: https://cloud.google.com/storage/docs/boto-gsutil)

Whereas with this PR, the steps are similar to the user experience with an s3:// hosted cache/mirror.

Install google-cloud-storage
Authenticate with gcloud auth application-default login

Then you can read from gs:// based caches and mirrors.

opadron · 2021-09-30T20:46:52Z

I agree that a better user experience could be a compelling argument in favor of changes like these.

However, the best trade off between this interest and maintainability is not clear. It'd be great if we could have other Spack devs & users weigh in, especially if they're also GCS users.

@tgamblin @adamjstewart @alalazo @gartung @scheibelp: would love to hear your opinion.

Also, @eugeneswalker, I believe you are a user of GCS, is that right? What do you think about having code in Spack for handling gs:// urls?

tgamblin · 2021-09-30T22:53:33Z

@opadron: we really want Spack to be useable by the different cloud providers using their own tooling, and this is a pretty small PR as far as they go. I don't think it's a huge maintenance burden to support this, and it's supported in a very similar way to the way we currently support boto3. I think if we don't merge this, we're biasing for one particular cloud provider over another, which is not what we want.

If merging stuff like this enables folks at GCP to deliver Spack efficiently to their users, it ultimately helps the project and get us a broader user base, which is what we want.

So I'd say please evaluate this considering the use of the GCP native libs as a bonus (the same way we consider boto support a bonus for AWS).

If there is a good way to support all of the object store interfaces smoothly without the native libraries, we could talk about that, but I think at this point we want buildcaches to be just as usable on whatever cloud you happen to be on.

gartung · 2021-10-01T14:20:19Z

I have no comment since we don't use this feature.

opadron · 2021-10-01T14:52:33Z

Thanks, I think @tgamblin made some good points for why we want native support for GCS. @douglasjacobsen, I think we're in agreement that this PR is useful and worth putting more time in to. Let us know if you have any more questions.

douglasjacobsen · 2021-10-04T16:18:21Z

@tgamblin, @opadron, and @scottwittenburg: OK, it seems like I've fixed most (if not all) of the CI checks. There's one still pending, but let me know if you see any other PR changes that you'd like me to make. I think I resolved all of the previous issues already.

This commit updates the GCS support to be more inline with how the S3 implementation works.

opadron

This is looking really good! Additional feedback below.

lib/spack/spack/fetch_strategy.py

lib/spack/spack/util/gcs.py

lib/spack/spack/util/web.py

douglasjacobsen · 2021-10-06T03:22:40Z

@opadron I think I fixed all of the issues you spotted, let me know if you see any new ones though.

douglasjacobsen · 2021-10-07T16:52:04Z

@scottwittenburg I think I've fixed all of the issues you spotted as well. Feel free to let me know if there are any more modifications you'd like to see though.

scottwittenburg

Looks good to me @douglasjacobsen, thanks!

fluidnumerics-joe · 2021-10-12T18:30:30Z

Just wanted to throw my vote of support for this feature - I've been working with @douglasjacobsen 's contributions on my own fork on Google Cloud and have been able to host binary caches on GCS that help considerably reduce build times for Google Compute Engine VM images.

alalazo

Thanks for this PR! I started having a look at the code, let me know what you think of the comments.

alalazo · 2021-10-19T12:49:36Z

lib/spack/spack/test/gcs_fetch.py

+
+
+@pytest.mark.parametrize('_fetch_method', ['curl', 'urllib'])
+def test_gcsfetchstrategy_sans_url(_fetch_method):


Maybe:

Suggested change

def test_gcsfetchstrategy_sans_url(_fetch_method):

def test_gcsfetchstrategy_without_url(_fetch_method):

? But in any case 🎉 🇫🇷 🎉

lib/spack/spack/fetch_strategy.py

alalazo · 2021-10-19T12:57:49Z

lib/spack/spack/gcs_handler.py

+import spack.util.web as web_util
+
+
+def gcs_open(req, *args, **kwargs):


Where are the *args and **kwargs used? I am wondering why is it necessary to have them in the function signature.

It might be also good to add a brief docstring here.

I only added *args and **kwargs to keep the signature the same as the other functions that are used as function pointers here: https://github.com/douglasjacobsen/spack/blob/b9370ba0d7e54f7de37ef4d5f680479aeb741d11/lib/spack/spack/util/web.py#L552

Would you like me to change it so I don't have them in the signature? I'll work on the docstring in the back ground.

lib/spack/spack/test/gcs_fetch.py

lib/spack/spack/util/gcs.py

douglasjacobsen · 2021-10-19T17:22:34Z

@alalazo: Let me know if these changes look good.

Thanks for the review!

alalazo

This LGTM. There's only a minor comment on a thing I overlooked on first review.

lib/spack/spack/util/gcs.py

Adressing review comments Co-authored-by: Massimiliano Culpo <[email protected]>

douglasjacobsen · 2021-10-21T20:02:33Z

No problem. Thanks @alalazo

For some reason some style errors came up in the last CI check, but I don't see them on my local machine. We'll see if they crop up again after it runs this time.

alalazo · 2021-10-22T04:22:47Z

Thanks!

spackbot-app bot added fetching update-package utilities labels Sep 30, 2021

tgamblin mentioned this pull request Sep 30, 2021

Add GCS Bucket Mirrors #24422

Closed

tgamblin reviewed Sep 30, 2021

View reviewed changes

lib/spack/spack/fetch_strategy.py Outdated Show resolved Hide resolved

tgamblin self-assigned this Sep 30, 2021

tgamblin requested review from opadron and scottwittenburg September 30, 2021 17:06

douglasjacobsen commented Sep 30, 2021

View reviewed changes

var/spack/repos/builtin/packages/lua/package.py Outdated Show resolved Hide resolved

douglasjacobsen commented Sep 30, 2021

View reviewed changes

lib/spack/spack/util/web.py Outdated Show resolved Hide resolved

douglasjacobsen commented Sep 30, 2021

View reviewed changes

lib/spack/spack/util/gcs.py Outdated Show resolved Hide resolved

scottwittenburg reviewed Sep 30, 2021

View reviewed changes

scottwittenburg requested changes Sep 30, 2021

View reviewed changes

opadron suggested changes Sep 30, 2021

View reviewed changes

lib/spack/spack/util/web.py Show resolved Hide resolved

lib/spack/spack/util/web.py Show resolved Hide resolved

lib/spack/spack/util/web.py Show resolved Hide resolved

douglasjacobsen force-pushed the gcs_cache branch 3 times, most recently from 99d04d7 to 146bbda Compare October 4, 2021 14:31

Refactor GCS support

50073bd

This commit updates the GCS support to be more inline with how the S3 implementation works.

douglasjacobsen force-pushed the gcs_cache branch from 146bbda to 50073bd Compare October 5, 2021 14:54

spackbot-app bot added the tests General test capability(ies) label Oct 5, 2021

opadron suggested changes Oct 5, 2021

View reviewed changes

lib/spack/spack/fetch_strategy.py Show resolved Hide resolved

lib/spack/spack/util/gcs.py Outdated Show resolved Hide resolved

lib/spack/spack/util/web.py Show resolved Hide resolved

lib/spack/spack/util/web.py Outdated Show resolved Hide resolved

douglasjacobsen force-pushed the gcs_cache branch from 711f9a4 to 45f6e6b Compare October 6, 2021 03:14

Address review comments

b9370ba

douglasjacobsen force-pushed the gcs_cache branch from 45f6e6b to b9370ba Compare October 6, 2021 03:21

opadron previously approved these changes Oct 6, 2021

View reviewed changes

scottwittenburg previously approved these changes Oct 7, 2021

View reviewed changes

alalazo requested changes Oct 19, 2021

View reviewed changes

douglasjacobsen dismissed stale reviews from scottwittenburg and opadron via 1119d1d October 19, 2021 17:21

douglasjacobsen force-pushed the gcs_cache branch from 1119d1d to af92e64 Compare October 19, 2021 19:08

Further address review comments

60a3e73

douglasjacobsen force-pushed the gcs_cache branch from af92e64 to 60a3e73 Compare October 21, 2021 19:32

alalazo reviewed Oct 21, 2021

View reviewed changes

lib/spack/spack/util/gcs.py Outdated Show resolved Hide resolved

Update lib/spack/spack/util/gcs.py

5f647f0

Adressing review comments Co-authored-by: Massimiliano Culpo <[email protected]>

alalazo approved these changes Oct 21, 2021

View reviewed changes

alalazo merged commit d1d0021 into spack:develop Oct 22, 2021

douglasjacobsen deleted the gcs_cache branch October 22, 2021 14:47



		@pytest.mark.parametrize('_fetch_method', ['curl', 'urllib'])
		def test_gcsfetchstrategy_sans_url(_fetch_method):

	def test_gcsfetchstrategy_sans_url(_fetch_method):
	def test_gcsfetchstrategy_without_url(_fetch_method):

		import spack.util.web as web_util


		def gcs_open(req, args, *kwargs):

Conversation

douglasjacobsen commented Sep 30, 2021

Uh oh!

spackbot-app bot commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scottwittenburg left a comment

Choose a reason for hiding this comment

Uh oh!

scottwittenburg left a comment

Choose a reason for hiding this comment

Uh oh!

opadron commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

scottwittenburg commented Sep 30, 2021

Uh oh!

opadron commented Sep 30, 2021

Uh oh!

scottwittenburg commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

opadron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

opadron commented Sep 30, 2021

Uh oh!

douglasjacobsen commented Sep 30, 2021

Uh oh!

opadron commented Sep 30, 2021

Uh oh!

tgamblin commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gartung commented Oct 1, 2021

Uh oh!

opadron commented Oct 1, 2021

Uh oh!

douglasjacobsen commented Oct 4, 2021

Uh oh!

opadron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

douglasjacobsen commented Oct 6, 2021

Uh oh!

douglasjacobsen commented Oct 7, 2021

Uh oh!

scottwittenburg left a comment

Choose a reason for hiding this comment

Uh oh!

fluidnumerics-joe commented Oct 12, 2021

Uh oh!

alalazo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalazo Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

douglasjacobsen Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

tgamblin commented Sep 30, 2021 •

edited

Loading

alalazo left a comment •

edited

Loading

douglasjacobsen commented Oct 19, 2021 •

edited

Loading