Skip to content

Add GCS Bucket Mirrors#26382

Merged
alalazo merged 5 commits intospack:developfrom
douglasjacobsen:gcs_cache
Oct 22, 2021
Merged

Add GCS Bucket Mirrors#26382
alalazo merged 5 commits intospack:developfrom
douglasjacobsen:gcs_cache

Conversation

@douglasjacobsen
Copy link
Copy Markdown
Contributor

This pull request contains changes to support Google Cloud Storage buckets as mirrors, meant for hosting Spack build-caches. This feature is beneficial for folks that are running infrastructure on Google Cloud Platform. On public cloud systems, resources are ephemeral and in many cases, installing compilers, MPI flavors, and user packages from scratch takes up considerable time.

Giving users the ability to host a Spack mirror that can store build caches in GCS buckets offers a clean solution for reducing application rebuilds for Google Cloud infrastructure.

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Sep 30, 2021

Hi @douglasjacobsen! I noticed that the following package(s) don't yet have maintainers:

  • lua

Are you interested in adopting any of these package(s)? If so, simply add the following to the package class:

    maintainers = ['douglasjacobsen']

If not, could you contact the developers of this package and see if they are interested? You can quickly see who has worked on a package with spack blame:

$ spack blame lua

Thank you for your help! Please don't add maintainers without their consent.

You don't have to be a Spack expert or package developer in order to be a "maintainer," it just gives us a list of users willing to review PRs or debug issues relating to this package. A package can have multiple maintainers; just add a list of GitHub handles of anyone who wants to volunteer.

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

This PR supersedes #24422

@tgamblin tgamblin self-assigned this Sep 30, 2021
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, but since it changes some things with respect to how S3 is handled, it's probably a good idea for @opadron to take a look.

Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed some cleanup jobs failing in gitlab pipelines, to do with the recursive argument to remove_url(). Here's a link to a failing job, though you can find it from the gitlab check on your PR as well.

https://gitlab.spack.io/spack/spack/-/jobs/1080302

@opadron
Copy link
Copy Markdown
Member

opadron commented Sep 30, 2021

Why do we need this? Isn't GCS S3-compatible?

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

This allows us to use google-cloud-storage's python API without having a dependence on boto3.

However, users could choose to use the s3 backend, and configure it to point to GCS resources by changing the environment variables instead of using this native GCS backend.

@scottwittenburg
Copy link
Copy Markdown
Contributor

Why do we need this? Isn't GCS S3-compatible?

Looks like it's adding a new url scheme, so I assumed GCS could be addressed with both url protocols.

@opadron
Copy link
Copy Markdown
Member

opadron commented Sep 30, 2021

This allows us to use google-cloud-storage's python API without having a dependence on boto3.

Ok, but to do that requires a dependency on the google-cloud-storage's python API. I'm not yet convinced this is worth adding on the basis of reducing dependencies because it doesn't.

Is there some kind of feature or capability that GCS provides that S3 does not? And, are we currently or planning to take advantage of those capabilities? Otherwise, I'm having a really hard time seeing why these changes are needed.

Maybe my reading of the above quote is off. Are you perhaps saying that using boto3 with GCS would also require the google-cloud-storage's python API?

@scottwittenburg
Copy link
Copy Markdown
Contributor

I just noticed some cleanup jobs failing in gitlab pipelines, to do with the recursive argument to remove_url(). Here's a link to a failing job, though you can find it from the gitlab check on your PR as well.

I think the cleanup job uses mirror.destroy, which invokes the remove_url() method. Maybe a quick grep for remove_url in the repo would find any other call sites.

But @douglasjacobsen I'm curious if you saw some documentation indicating that we no longer need the special handling for recursive removal of a url? I would be happy to be enlightened if you could share a link. Thanks!

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

Yes, using the gs:// URL scheme would require using the google-cloud-storage API rather than the boto3 API. Using boto3 with GCS does not require the google-cloud-storage API.

These changes more allow users who are already using the Google Cloud SDK to access GCS buckets using their SDK credentials and the default endpoint, similar to how the default endpoint for s3:// is Amazon's s3. So, the goal of this PR is primarily to ease the barrier of entry for people using google tools to access GCS buckets, and provide an implementation that mirrors what the s3 backend currently does (relying on Amazon tools and libraries).

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

@scottwittenburg: Sorry, this is a bug. I'm going to work on fixing it asap. I think the version we started working on didn't have the recusrive destruction of these URLs, and when I rebased these changes didn't translate properly.

For the GCS backend, we could use force to try and make the API do a recursive destruction, however that fails if there are more than 256 objects in the bucket. So it's likely safer to iterate similar to how the s3 backend works.

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

We should probably resolve the discussion about whether or not this is useful before I put more time into fixing the PR though. :)

Copy link
Copy Markdown
Member

@opadron opadron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused by these proposed changes. They concern me because I worry that it might be unnecessarily adding new code paths to access GCS, which as far as I understand, are already accessible via an s3:// URL and boto.

Besides the above concern, I notice that a number of the proposed changes modify how s3:// urls are handled as well as other minor details of how Spack operates without a clear reason. The PR author themselves notes that one of these changes are unrelated and should be removed. Perhaps this is supposed to be a WIP and not yet ready for review?

@douglasjacobsen if I am in error, please accept my apologies. Help us to understand your intent and what we can do to help.

@opadron
Copy link
Copy Markdown
Member

opadron commented Sep 30, 2021

@douglasjacobsen so, I just noticed your comments from the last 15 minutes or so. Thanks for helping to clear up some of my questions.

I think a big part of what you're going for is a good developer experience for users of GCS tooling, is that right? I can see how it would be more convenient to just use existing tooling, but I think we should balance that desire against the maintenance cost of maintaining gcs and s3 url handling vs just s3 urls. What does the experience look like if someone using google cloud sdk needs to install boto and access GCS through our s3 url handling? Is there a lot of extra steps that the user would need to take? Other barriers to use? These are perhaps some things we should consider.

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

@opadron Thanks for the discussion. And I should probably apologize for the changes that are irrelevant. This should not change the s3 backend, and it should not modify how spack works (i.e. the recursive removal is an accident).

This is a branch that was started a long time ago (#24422) and I've been working with the original author to fix some of the issues we had with it. The unrelated changes that are in the PR are simply an oversight on my part when I rebased the branch to fix some of the merge conflicts.

Regarding user experience, yes that's exactly what I'm going for. The steps a user needs to go through to access a GCS bucket with the existing s3 backend are something similar to the following (which honestly are not very well documented and it takes a lot for a naive user to figure out):

  1. Install boto3 (obvious if you're using an s3 bucket, but seems awkward if you're already using google tooling)
  2. Set the S3_ENDPOINT_URL env-var to https://storage.googleapis.com/download/storage/v1 (not very well documented anywhere)
  3. Configure Boto to access GCS resources (something like this: https://cloud.google.com/storage/docs/boto-gsutil)

Whereas with this PR, the steps are similar to the user experience with an s3:// hosted cache/mirror.

  1. Install google-cloud-storage
  2. Authenticate with gcloud auth application-default login

Then you can read from gs:// based caches and mirrors.

@opadron
Copy link
Copy Markdown
Member

opadron commented Sep 30, 2021

I agree that a better user experience could be a compelling argument in favor of changes like these.

However, the best trade off between this interest and maintainability is not clear. It'd be great if we could have other Spack devs & users weigh in, especially if they're also GCS users.

@tgamblin @adamjstewart @alalazo @gartung @scheibelp: would love to hear your opinion.

Also, @eugeneswalker, I believe you are a user of GCS, is that right? What do you think about having code in Spack for handling gs:// urls?

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Sep 30, 2021

@opadron: we really want Spack to be useable by the different cloud providers using their own tooling, and this is a pretty small PR as far as they go. I don't think it's a huge maintenance burden to support this, and it's supported in a very similar way to the way we currently support boto3. I think if we don't merge this, we're biasing for one particular cloud provider over another, which is not what we want.

If merging stuff like this enables folks at GCP to deliver Spack efficiently to their users, it ultimately helps the project and get us a broader user base, which is what we want.

So I'd say please evaluate this considering the use of the GCP native libs as a bonus (the same way we consider boto support a bonus for AWS).

If there is a good way to support all of the object store interfaces smoothly without the native libraries, we could talk about that, but I think at this point we want buildcaches to be just as usable on whatever cloud you happen to be on.

@gartung
Copy link
Copy Markdown
Member

gartung commented Oct 1, 2021

I have no comment since we don't use this feature.

@opadron
Copy link
Copy Markdown
Member

opadron commented Oct 1, 2021

Thanks, I think @tgamblin made some good points for why we want native support for GCS. @douglasjacobsen, I think we're in agreement that this PR is useful and worth putting more time in to. Let us know if you have any more questions.

@douglasjacobsen douglasjacobsen force-pushed the gcs_cache branch 3 times, most recently from 99d04d7 to 146bbda Compare October 4, 2021 14:31
@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

@tgamblin, @opadron, and @scottwittenburg: OK, it seems like I've fixed most (if not all) of the CI checks. There's one still pending, but let me know if you see any other PR changes that you'd like me to make. I think I resolved all of the previous issues already.

This commit updates the GCS support to be more inline with how the S3
implementation works.
@spackbot-app spackbot-app bot added the tests General test capability(ies) label Oct 5, 2021
Copy link
Copy Markdown
Member

@opadron opadron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Additional feedback below.

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

@opadron I think I fixed all of the issues you spotted, let me know if you see any new ones though.

opadron
opadron previously approved these changes Oct 6, 2021
@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

@scottwittenburg I think I've fixed all of the issues you spotted as well. Feel free to let me know if there are any more modifications you'd like to see though.

scottwittenburg
scottwittenburg previously approved these changes Oct 7, 2021
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @douglasjacobsen, thanks!

@fluidnumerics-joe
Copy link
Copy Markdown
Contributor

Just wanted to throw my vote of support for this feature - I've been working with @douglasjacobsen 's contributions on my own fork on Google Cloud and have been able to host binary caches on GCS that help considerably reduce build times for Google Compute Engine VM images.

Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I started having a look at the code, let me know what you think of the comments.



@pytest.mark.parametrize('_fetch_method', ['curl', 'urllib'])
def test_gcsfetchstrategy_sans_url(_fetch_method):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
def test_gcsfetchstrategy_sans_url(_fetch_method):
def test_gcsfetchstrategy_without_url(_fetch_method):

? But in any case 🎉 🇫🇷 🎉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

import spack.util.web as web_util


def gcs_open(req, *args, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the *args and **kwargs used? I am wondering why is it necessary to have them in the function signature.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also good to add a brief docstring here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added *args and **kwargs to keep the signature the same as the other functions that are used as function pointers here: https://github.com/douglasjacobsen/spack/blob/b9370ba0d7e54f7de37ef4d5f680479aeb741d11/lib/spack/spack/util/web.py#L552

Would you like me to change it so I don't have them in the signature? I'll work on the docstring in the back ground.

@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

douglasjacobsen commented Oct 19, 2021

@alalazo: Let me know if these changes look good.

Thanks for the review!

Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. There's only a minor comment on a thing I overlooked on first review.

Adressing review comments

Co-authored-by: Massimiliano Culpo <[email protected]>
@douglasjacobsen
Copy link
Copy Markdown
Contributor Author

No problem. Thanks @alalazo

For some reason some style errors came up in the last CI check, but I don't see them on my local machine. We'll see if they crop up again after it runs this time.

@alalazo alalazo merged commit d1d0021 into spack:develop Oct 22, 2021
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Oct 22, 2021

Thanks!

@douglasjacobsen douglasjacobsen deleted the gcs_cache branch October 22, 2021 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants