Skip to content

S3 upload and download#11117

Merged
tgamblin merged 45 commits intospack:developfrom
opadron:s3-upload-and-download
Oct 22, 2019
Merged

S3 upload and download#11117
tgamblin merged 45 commits intospack:developfrom
opadron:s3-upload-and-download

Conversation

@opadron
Copy link
Copy Markdown
Member

@opadron opadron commented Apr 5, 2019

This extends Spack functionality so that it can fetch sources and binaries from-, push sources and binaries to-, and index the contents of- mirrors hosted on an S3 bucket.

Rough to-do list:

  • Extend mirrors configuration to add support for file://, and s3:// URLs.
  • Ensure all fetching, pushing, and indexing operations work for file:// URLs.
  • Implement S3 source fetching
  • Implement S3 binary mirror indexing
  • Implement S3 binary package fetching
  • Implement S3 source pushing
  • Implement S3 binary package pushing

Copy link
Copy Markdown
Member Author

@opadron opadron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the challenges I'm struggling with is doing a good job of engineering good APIs for these additions; ones that will fit well with the rest of the Spack codebase. I'm hoping some more experienced Spack hackers can help guide me.

url, digest, expand=expand, extension=extension))
# fetchers.insert(
# 0, fs.URLFetchStrategy(
# url, digest, expand=expand, extension=extension))
Copy link
Copy Markdown
Member Author

@opadron opadron Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the URL scheme-based function is used to determine what fetch strategy to employ.

Question: In testing, I noticed that if one of the strategies throws, the whole fetch operation fails. I understand that what is preferred is that every URL should be tried (with their associated fetch strategies), and only if they all fail to download the source should the fetch operation fail. What do others think? If this behavior is, indeed, preferred, where would be the best places to make the necessary changes?

This comment was marked as resolved.

This comment was marked as resolved.

@opadron opadron force-pushed the s3-upload-and-download branch from bdfe302 to 1c5614b Compare May 24, 2019 17:43
@opadron opadron force-pushed the s3-upload-and-download branch from 9572c33 to 1183971 Compare June 11, 2019 19:09
Copy link
Copy Markdown
Member

@tgamblin tgamblin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opadron: I haven't had a chance to go over all the questions here, but I added a bunch of review comments. In general, the high level plan looks good, and the refactors are good. However:

  1. A lot of the mirror code is repetitive and overly complex. I recommend introducing a Mirror class to take care of that. Currently too much of the code deals directly with YAML instead of with a Mirror abstraction.
  2. Boxes are checked on the PR, but I don't see where the commands now have push or pull options. Where is that implemented?
  3. We discussed making /var/spack/cache the "default" mirror where build caches would be created and where things would be created if no --directory argument is specified for spack mirror create and spack buildcache create. Is that planned?
  4. There is a lot of stuff in here for parsing URLs and for things like noopify, and I am not sure how it factors in or why some of it is needed. Maybe we can talk about that at an upcoming meeting.

@opadron
Copy link
Copy Markdown
Member Author

opadron commented Sep 3, 2019

@tgamblin I've rebased and added two new commits. Adding the Mirror and MirrorCollection classes went a long way towards cleaning up a lot of the more minor points you had brought up. Have a look and let me know if you think I'm on the right track.

@opadron opadron force-pushed the s3-upload-and-download branch from 715ff99 to 14758e1 Compare September 9, 2019 15:36
@opadron opadron force-pushed the s3-upload-and-download branch from 14758e1 to 3b00298 Compare September 19, 2019 17:13
@opadron opadron force-pushed the s3-upload-and-download branch from 3b00298 to aab456b Compare October 2, 2019 16:23
</html>
'''

BUILD_CACHE_INDEX_ENTRY_TEMPLATE = ' <li><a href="{path}">{path}</a>'
Copy link
Copy Markdown
Contributor

@eugeneswalker eugeneswalker Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we close out the <li> tag in BUILD_CACHE_INDEX_ENTRY_TEMPLATE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!


verify_ssl = spack.config.get('config:verify_ssl')

if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_path):
if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_url):

Believe this should be changed, currently experiencing a Warning: 'str' object has no attribute 'scheme'.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, @paulbry!



def create_s3_session(url):
parsed_url = urllib_parse.urlparse(
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hitting an issue here when I attempt to use util/web.py to push_to_url() a file. Below is some output illustrating the problem, but basically, it seems like sometimes the url argument here may have already been parsed as a url.

scott@beast:/data/scott/projects/spack$ export AWS_ACCESS_KEY_ID=minio
scott@beast:/data/scott/projects/spack$ export AWS_SECRET_ACCESS_KEY=minio123
scott@beast:/data/scott/projects/spack$ export AWS_ENDPOINT_URL=http://localhost:8083
scott@beast:/data/scott/projects/spack$ source share/spack/setup-env.sh 
scott@beast:/data/scott/projects/spack$ spack python
Spack version 0.12.1
Python 3.6.8, Linux x86_64
>>> import spack.util.web as web_util
>>> local_url = 'file:///tmp/junk.txt'
>>> remote_url = 's3://junkbox/stuff/junk.txt'
>>> web_util.push_to_url(local_url, remote_url)
Traceback (most recent call last):
  File "/usr/lib/python3.6/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/data/scott/projects/spack/lib/spack/spack/util/web.py", line 254, in push_to_url
    s3 = s3_util.create_s3_session(remote_url)
  File "/data/scott/projects/spack/lib/spack/spack/util/s3.py", line 17, in create_s3_session
    allow_fragments=False)
  File "/usr/lib/python3.6/urllib/parse.py", line 367, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "/usr/lib/python3.6/urllib/parse.py", line 120, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
>>> 
now exiting InteractiveConsole...
scott@beast:/data/scott/projects/spack$ spack python
Spack version 0.12.1
Python 3.6.8, Linux x86_64
>>> import spack.util.web as web_util
>>> local_url = 'file:///tmp/junk.txt'
>>> remote_url = 's3://junkbox/stuff/junk.txt'
>>> web_util.push_to_url(local_url, remote_url)
> /data/scott/projects/spack/lib/spack/spack/util/s3.py(16)create_s3_session()
-> parsed_url = urllib_parse.urlparse(
(Pdb) l
 11  	
 12  	
 13  	def create_s3_session(url):
 14  	    import pdb
 15  	    pdb.set_trace()
 16  ->	    parsed_url = urllib_parse.urlparse(
 17  	        url,
 18  	        scheme='file',
 19  	        allow_fragments=False)
 20  	
 21  	    if parsed_url.scheme != 's3':
(Pdb) p url
ParseResult(scheme='s3', netloc='junkbox', path='/stuff/junk.txt', params='', query='', fragment='')
(Pdb)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following patch fixed it for me and I was able to push the local file into the S3 bucket:

scott@beast:/data/scott/projects/spack$ git diff
diff --git a/lib/spack/spack/util/s3.py b/lib/spack/spack/util/s3.py
index 2d0a34a29..095b057b9 100644
--- a/lib/spack/spack/util/s3.py
+++ b/lib/spack/spack/util/s3.py
@@ -11,10 +11,12 @@ import spack
 
 
 def create_s3_session(url):
-    parsed_url = urllib_parse.urlparse(
-        url,
-        scheme='file',
-        allow_fragments=False)
+    parsed_url = url
+    if not isinstance(parsed_url, urllib_parse.ParseResult):
+        parsed_url = urllib_parse.urlparse(
+            url,
+            scheme='file',
+            allow_fragments=False)
 
     if parsed_url.scheme != 's3':
         raise ValueError(
scott@beast:/data/scott/projects/spack$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not suggesting that is necessarily the right solution, but hopefully it makes clear what could be going wrong.

Copy link
Copy Markdown
Member

@gartung gartung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is amazing to see how far my simple idea for downloading the spec files from an index.html has evolved.

@opadron opadron force-pushed the s3-upload-and-download branch from 10bf996 to aa8f5cb Compare October 14, 2019 18:20
@tgamblin
Copy link
Copy Markdown
Member

@opadron: this is still failing CI.

Omar Padron added 3 commits October 18, 2019 12:04
The urlparse wrapper is added back in.  I found
that leaving it out and trying to stick with
urllib's parse created too many problems, first
with duplicated logic at almost every call site,
and more importantly when trying to correctly
work with relative file paths.  The added logic
doesn't just reduce boilerplate, but cleanly
handles file paths without breaking our URL
abstraction.  Ultimately, I think adding the
wrapper makes the code simpler.

This reverts commit bc31930.
@opadron opadron changed the title [WIP] S3 upload and download S3 upload and download Oct 19, 2019
@tgamblin
Copy link
Copy Markdown
Member

@scheibelp: I think this is going to have a few conflicts with #12940 -- can you take a look at this one and make a recommendation about which we should merge first? I think it may be easier to put this in first, then adapt #12940 around it, but I could be wrong.

@scottwittenburg
Copy link
Copy Markdown
Contributor

@opadron @tgamblin I just wanted to chime in here that after the latest commits, this branch is working for me. I'm using it mainly in the context of build pipelines where it's working for pushing buildcaches to mirrors, installing dependencies from them, and downloading buildcache files from the remote binary mirror to the local artifacts buildcache.

FYI @aashish24

@scheibelp
Copy link
Copy Markdown
Member

I think it may be easier to put this in first, then adapt #12940 around it

That works for me. My only thought on which one to merge first is that merging the bigger one first might save some time (and this one is bigger). That being said if review for this takes a while I wouldn't consider it problematic to merge #12940 first (but there are still review comments to address there).

@aashish24
Copy link
Copy Markdown

thanks @scottwittenburg @opadron what do you think? If all looks good, I would think it is better to merge this as this one is long standing @scheibelp and may help with release

@opadron
Copy link
Copy Markdown
Member Author

opadron commented Oct 21, 2019

Yeah, sorry if I wasn't clear, but in the absence of any further requests for changes, this should be ready for merge.

Copy link
Copy Markdown
Member

@tgamblin tgamblin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opadron: Given that this is allowing @scottwittenburg to make progress on pipelines, and given that we need the initial S3 support, I think we can merge this. It is not on the critical path for other core parts of Spack, and it at least gets some things working.

That said, for nearly every other Spack PR, we expect there to be tests. This PR has 46% coverage and reduces overall coverage on the project by .5%. We need to get this code covered with tests ASAP after it is merged. I also put some comments below -- those should also be addressed in a near-term followup PR.

@scheibelp: I tried to merge and took a look at the potential conflicts between this and #12940, and there do not seem to be very many. There are some that are longer regions of code, but the sub-diffs are short. I am hoping this will not take too much time to resolve.

shutil.move(index_html_path_tmp, index_html_path)
Creates (or replaces) the "index.html" page at the location given in
cache_prefix. This page contains a link for each binary package (*.yaml)
and signing key (*.key) under cache_prefix.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a signing key or a public key for the cache? The signing key really shouldn't be in the cache.



def push_to_url(local_path, remote_path, **kwargs):
keep_original = kwargs.get('keep_original', True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an explicit kwarg (keep_original=True instead of **kwargs) -- only use **kwargs if you have to, as you get no argument name checking when you use it like this.

if scheme == 'file':
path = spack.util.path.canonicalize_path(netloc + path)
while path.startswith('//'):
path = path[1:]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly a matter of taste, but I find path = re.sub(r'^/+', '/', path) to be clearer.

'https://mirror.spack.io/build_cache/my-package'
"""
base_url = parse(base_url)
resolve_href = kwargs.get('resolve_href', False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an explicitly named arg (resolve_href=...), not **kwargs.

the relative URL.

If resolve_href is False (default), then the URL path components are joined
as in os.path.join().
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me worry that there is too much complexity in this function (especially since there are no tests). This function is only used once with resolve_href=True, which kind of makes me wonder why we didn't just use os.path.join in all the other places and have a special function for the one weird place. That said, the places where this is used are cleaner now (they weren't using os.path.join before) so let's keep it like it is for now.


"Server": set(('Server', 'server'))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole dictionary is unnecessarily complicated. See below.

- content Length

... and any other header name, such as "Content-encoding", would not be
altered, regardless of spelling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why just these?

- contentlength
- content_Length
- contentLength
- content Length
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the confusing dictionary way up above, why not:

_canonical_headers = [
    "Accept-ranges",
    "Content-length",
    "Content-type",
    "Date",
    "Last-modified",
    "Server",
]

# Annoying Python 2.6 compatible dict constructor
_lookup = dict((h.lower().replace('-', ''), h) for h in _canonical_headers)

def canonicalize(header_name):
    key = re.sub(r'^([A-Za-z]+)[ _-]([A-Za-z]+)$', r'\1\2', header_name).lower()
    return _lookup.get(key)    # None if not present

Just lookup the lowercase, concatenated version and return the standard version instead of creating a massive, hard-to-read cross-product dictionary.

Or if you want something even simpler that doesn't require any fancy lookups:

def get_header(headers, header_name):
    def canonical(name):
        return re.sub(r'[ _-]', '', header_name).lower()

    return next((v for h, v in headers.items() 
                if canonical(h) == canonical(header_name)), None)

...
# get any of the matching strings you care about
content_type = get_header(resp.headers, 'Content-type')

that of their corresponding item in headers, or if the keys of multiple
items in headers map to the same key after being standardized.

In all other cases headers is returned unaltered.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this function is used exactly once, in _s3_open. How many of these cases are relevant for that use case? I think this function does too much -- and it doesn't say what issue it's really trying to solve.

Unless _s3_open really needs all this, I'd much rather get rid of this whole thing and use one of the simpler approaches above for _s3_open so that we do not have to maintain all this code along with tests for every possible code path in here.

if not self.archive_file:
raise NoArchiveFileError("Cannot call archive() before fetching.")

shutil.copyfile(self.archive_file, destination)
Copy link
Copy Markdown
Member

@scheibelp scheibelp Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opadron I think the following is causing #13404:

  • The archive file is determined in part by the basename of the URL, in this case libuuid-1.0.3.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Flibuuid%2F&ts=1433881396&use_mirror=iweb
  • Originally the shutil.copyfile command happily used this as the source name of the file (which was weird, but it worked)
  • Now the web.push_to_url function is used, and it modifies the supplied local path with url_util.parse(local_path), the formatted filename, stripped of the extra ?r=http%3A%2F%2Fsourceforge.net..., does not exist (EDIT: added explanation for clarity)

I think that

  • web.push_to_url should not format the input path (the local file is what it is) i.e. remove the url_util.parse(local_path)
  • arguably the archive filename should be formatted better (but technically that is not required) which is to say that the archive name should be based on url_util.parse of the basename of the URL

Would just removing url_util.parse(local_path) cause problems that you know of?

tgamblin pushed a commit that referenced this pull request Oct 23, 2019
…ed to it (#13408)

fd58c98 formats the `Stage`'s `archive_path` in `Stage.archive` (as part of `web.push_to_url`). This is not needed and if the formatted differs from the original path (for example if the archive file name contains a URL query suffix), then the copy fails.

This removes the formatting that occurs in `web.push_to_url`.

We should figure out a way to handle bad cases like this *and* to have nicer filenames for downloaded files.  One option that would work in this particular case would be to also pass `-J` / `--remote-header-name` to `curl`.  We'll need to do follow-up work to determine if we can use `-J` everywhere.

See also: #11117 (comment)
@opadron opadron deleted the s3-upload-and-download branch October 28, 2019 15:41
jrmadsen pushed a commit to jrmadsen/spack that referenced this pull request Oct 30, 2019
This extends Spack functionality so that it can fetch sources and binaries from-, push sources and binaries to-, and index the contents of- mirrors hosted on an S3 bucket.

High level to-do list:

- [x] Extend mirrors configuration to add support for `file://`, and `s3://` URLs.
- [x] Ensure all fetching, pushing, and indexing operations work for `file://` URLs.
- [x] Implement S3 source fetching
- [x] Implement S3 binary mirror indexing
- [x] Implement S3 binary package fetching
- [x] Implement S3 source pushing
- [x] Implement S3 binary package pushing

Important details:

* refactor URL handling to handle S3 URLs and mirror URLs more gracefully.
  - updated parse() to accept already-parsed URL objects.  an equivalent object
    is returned with any extra s3-related attributes intact.  Objects created with
    urllib can also be passed, and the additional s3 handling logic will still be applied.

* update mirror schema/parsing (mirror can have separate fetch/push URLs)
* implement s3_fetch_strategy/several utility changes
* provide more feature-complete S3 fetching
* update buildcache create command to support S3

* Move the core logic for reading data from S3 out of the s3 fetch strategy and into
  the s3 URL handler.  The s3 fetch strategy now calls into `read_from_url()` Since
  read_from_url can now handle S3 URLs, the S3 fetch strategy is redundant.  It's
  not clear whether the ideal design is to have S3 fetching functionality in a fetch
  strategy, directly implemented in read_from_url, or both.

* expanded what can be passed to `spack buildcache` via the -d flag: In addition
  to a directory on the local filesystem, the name of a configured mirror can be
  passed, or a push URL can be passed directly.
jrmadsen pushed a commit to jrmadsen/spack that referenced this pull request Oct 30, 2019
…ed to it (spack#13408)

fd58c98 formats the `Stage`'s `archive_path` in `Stage.archive` (as part of `web.push_to_url`). This is not needed and if the formatted differs from the original path (for example if the archive file name contains a URL query suffix), then the copy fails.

This removes the formatting that occurs in `web.push_to_url`.

We should figure out a way to handle bad cases like this *and* to have nicer filenames for downloaded files.  One option that would work in this particular case would be to also pass `-J` / `--remote-header-name` to `curl`.  We'll need to do follow-up work to determine if we can use `-J` everywhere.

See also: spack#11117 (comment)
opadron pushed a commit to opadron/spack that referenced this pull request Nov 4, 2019
opadron pushed a commit to opadron/spack that referenced this pull request Nov 5, 2019
opadron pushed a commit to opadron/spack that referenced this pull request Nov 14, 2019
opadron pushed a commit that referenced this pull request Dec 9, 2019
* fix docstring in generate_package_index() refering to "public" keys as "signing" keys

* use explicit kwargs in push_to_url()

* simplify url_util.parse() per tgamblin's suggestion

* replace standardize_header_names() with the much simpler get_header()

* add some basic tests

* update s3_fetch tests

* update S3 list code to strip leading slashes from prefix

* correct minor warning regression introduced in #11117

* add more tests

* flake8 fixes

* add capsys fixture to mirror_crud test

* add get_header() tests

* use get_header() in more places

* incorporate review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants