Conversation
opadron
left a comment
There was a problem hiding this comment.
One of the challenges I'm struggling with is doing a good job of engineering good APIs for these additions; ones that will fit well with the rest of the Spack codebase. I'm hoping some more experienced Spack hackers can help guide me.
| url, digest, expand=expand, extension=extension)) | ||
| # fetchers.insert( | ||
| # 0, fs.URLFetchStrategy( | ||
| # url, digest, expand=expand, extension=extension)) |
There was a problem hiding this comment.
Here, the URL scheme-based function is used to determine what fetch strategy to employ.
Question: In testing, I noticed that if one of the strategies throws, the whole fetch operation fails. I understand that what is preferred is that every URL should be tried (with their associated fetch strategies), and only if they all fail to download the source should the fetch operation fail. What do others think? If this behavior is, indeed, preferred, where would be the best places to make the necessary changes?
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
bdfe302 to
1c5614b
Compare
9572c33 to
1183971
Compare
tgamblin
left a comment
There was a problem hiding this comment.
@opadron: I haven't had a chance to go over all the questions here, but I added a bunch of review comments. In general, the high level plan looks good, and the refactors are good. However:
- A lot of the mirror code is repetitive and overly complex. I recommend introducing a
Mirrorclass to take care of that. Currently too much of the code deals directly with YAML instead of with aMirrorabstraction. - Boxes are checked on the PR, but I don't see where the commands now have push or pull options. Where is that implemented?
- We discussed making
/var/spack/cachethe "default" mirror where build caches would be created and where things would be created if no--directoryargument is specified forspack mirror createandspack buildcache create. Is that planned? - There is a lot of stuff in here for parsing URLs and for things like
noopify, and I am not sure how it factors in or why some of it is needed. Maybe we can talk about that at an upcoming meeting.
1183971 to
715ff99
Compare
|
@tgamblin I've rebased and added two new commits. Adding the |
715ff99 to
14758e1
Compare
14758e1 to
3b00298
Compare
3b00298 to
aab456b
Compare
| </html> | ||
| ''' | ||
|
|
||
| BUILD_CACHE_INDEX_ENTRY_TEMPLATE = ' <li><a href="{path}">{path}</a>' |
There was a problem hiding this comment.
Shouldn't we close out the <li> tag in BUILD_CACHE_INDEX_ENTRY_TEMPLATE?
lib/spack/spack/util/web.py
Outdated
|
|
||
| verify_ssl = spack.config.get('config:verify_ssl') | ||
|
|
||
| if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_path): |
There was a problem hiding this comment.
| if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_path): | |
| if __UNABLE_TO_VERIFY_SSL and verify_ssl and uses_ssl(remote_url): |
Believe this should be changed, currently experiencing a Warning: 'str' object has no attribute 'scheme'.
lib/spack/spack/util/s3.py
Outdated
|
|
||
|
|
||
| def create_s3_session(url): | ||
| parsed_url = urllib_parse.urlparse( |
There was a problem hiding this comment.
I'm hitting an issue here when I attempt to use util/web.py to push_to_url() a file. Below is some output illustrating the problem, but basically, it seems like sometimes the url argument here may have already been parsed as a url.
scott@beast:/data/scott/projects/spack$ export AWS_ACCESS_KEY_ID=minio
scott@beast:/data/scott/projects/spack$ export AWS_SECRET_ACCESS_KEY=minio123
scott@beast:/data/scott/projects/spack$ export AWS_ENDPOINT_URL=http://localhost:8083
scott@beast:/data/scott/projects/spack$ source share/spack/setup-env.sh
scott@beast:/data/scott/projects/spack$ spack python
Spack version 0.12.1
Python 3.6.8, Linux x86_64
>>> import spack.util.web as web_util
>>> local_url = 'file:///tmp/junk.txt'
>>> remote_url = 's3://junkbox/stuff/junk.txt'
>>> web_util.push_to_url(local_url, remote_url)
Traceback (most recent call last):
File "/usr/lib/python3.6/code.py", line 91, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
File "/data/scott/projects/spack/lib/spack/spack/util/web.py", line 254, in push_to_url
s3 = s3_util.create_s3_session(remote_url)
File "/data/scott/projects/spack/lib/spack/spack/util/s3.py", line 17, in create_s3_session
allow_fragments=False)
File "/usr/lib/python3.6/urllib/parse.py", line 367, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "/usr/lib/python3.6/urllib/parse.py", line 120, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
>>>
now exiting InteractiveConsole...
scott@beast:/data/scott/projects/spack$ spack python
Spack version 0.12.1
Python 3.6.8, Linux x86_64
>>> import spack.util.web as web_util
>>> local_url = 'file:///tmp/junk.txt'
>>> remote_url = 's3://junkbox/stuff/junk.txt'
>>> web_util.push_to_url(local_url, remote_url)
> /data/scott/projects/spack/lib/spack/spack/util/s3.py(16)create_s3_session()
-> parsed_url = urllib_parse.urlparse(
(Pdb) l
11
12
13 def create_s3_session(url):
14 import pdb
15 pdb.set_trace()
16 -> parsed_url = urllib_parse.urlparse(
17 url,
18 scheme='file',
19 allow_fragments=False)
20
21 if parsed_url.scheme != 's3':
(Pdb) p url
ParseResult(scheme='s3', netloc='junkbox', path='/stuff/junk.txt', params='', query='', fragment='')
(Pdb)
There was a problem hiding this comment.
The following patch fixed it for me and I was able to push the local file into the S3 bucket:
scott@beast:/data/scott/projects/spack$ git diff
diff --git a/lib/spack/spack/util/s3.py b/lib/spack/spack/util/s3.py
index 2d0a34a29..095b057b9 100644
--- a/lib/spack/spack/util/s3.py
+++ b/lib/spack/spack/util/s3.py
@@ -11,10 +11,12 @@ import spack
def create_s3_session(url):
- parsed_url = urllib_parse.urlparse(
- url,
- scheme='file',
- allow_fragments=False)
+ parsed_url = url
+ if not isinstance(parsed_url, urllib_parse.ParseResult):
+ parsed_url = urllib_parse.urlparse(
+ url,
+ scheme='file',
+ allow_fragments=False)
if parsed_url.scheme != 's3':
raise ValueError(
scott@beast:/data/scott/projects/spack$
There was a problem hiding this comment.
To be clear, I'm not suggesting that is necessarily the right solution, but hopefully it makes clear what could be going wrong.
gartung
left a comment
There was a problem hiding this comment.
It is amazing to see how far my simple idea for downloading the spec files from an index.html has evolved.
10bf996 to
aa8f5cb
Compare
|
@opadron: this is still failing CI. |
The urlparse wrapper is added back in. I found that leaving it out and trying to stick with urllib's parse created too many problems, first with duplicated logic at almost every call site, and more importantly when trying to correctly work with relative file paths. The added logic doesn't just reduce boilerplate, but cleanly handles file paths without breaking our URL abstraction. Ultimately, I think adding the wrapper makes the code simpler. This reverts commit bc31930.
|
@scheibelp: I think this is going to have a few conflicts with #12940 -- can you take a look at this one and make a recommendation about which we should merge first? I think it may be easier to put this in first, then adapt #12940 around it, but I could be wrong. |
|
@opadron @tgamblin I just wanted to chime in here that after the latest commits, this branch is working for me. I'm using it mainly in the context of build pipelines where it's working for pushing buildcaches to mirrors, installing dependencies from them, and downloading buildcache files from the remote binary mirror to the local artifacts buildcache. FYI @aashish24 |
That works for me. My only thought on which one to merge first is that merging the bigger one first might save some time (and this one is bigger). That being said if review for this takes a while I wouldn't consider it problematic to merge #12940 first (but there are still review comments to address there). |
|
thanks @scottwittenburg @opadron what do you think? If all looks good, I would think it is better to merge this as this one is long standing @scheibelp and may help with release |
|
Yeah, sorry if I wasn't clear, but in the absence of any further requests for changes, this should be ready for merge. |
There was a problem hiding this comment.
@opadron: Given that this is allowing @scottwittenburg to make progress on pipelines, and given that we need the initial S3 support, I think we can merge this. It is not on the critical path for other core parts of Spack, and it at least gets some things working.
That said, for nearly every other Spack PR, we expect there to be tests. This PR has 46% coverage and reduces overall coverage on the project by .5%. We need to get this code covered with tests ASAP after it is merged. I also put some comments below -- those should also be addressed in a near-term followup PR.
@scheibelp: I tried to merge and took a look at the potential conflicts between this and #12940, and there do not seem to be very many. There are some that are longer regions of code, but the sub-diffs are short. I am hoping this will not take too much time to resolve.
| shutil.move(index_html_path_tmp, index_html_path) | ||
| Creates (or replaces) the "index.html" page at the location given in | ||
| cache_prefix. This page contains a link for each binary package (*.yaml) | ||
| and signing key (*.key) under cache_prefix. |
There was a problem hiding this comment.
Is it a signing key or a public key for the cache? The signing key really shouldn't be in the cache.
|
|
||
|
|
||
| def push_to_url(local_path, remote_path, **kwargs): | ||
| keep_original = kwargs.get('keep_original', True) |
There was a problem hiding this comment.
Make this an explicit kwarg (keep_original=True instead of **kwargs) -- only use **kwargs if you have to, as you get no argument name checking when you use it like this.
| if scheme == 'file': | ||
| path = spack.util.path.canonicalize_path(netloc + path) | ||
| while path.startswith('//'): | ||
| path = path[1:] |
There was a problem hiding this comment.
possibly a matter of taste, but I find path = re.sub(r'^/+', '/', path) to be clearer.
| 'https://mirror.spack.io/build_cache/my-package' | ||
| """ | ||
| base_url = parse(base_url) | ||
| resolve_href = kwargs.get('resolve_href', False) |
There was a problem hiding this comment.
use an explicitly named arg (resolve_href=...), not **kwargs.
| the relative URL. | ||
|
|
||
| If resolve_href is False (default), then the URL path components are joined | ||
| as in os.path.join(). |
There was a problem hiding this comment.
This makes me worry that there is too much complexity in this function (especially since there are no tests). This function is only used once with resolve_href=True, which kind of makes me wonder why we didn't just use os.path.join in all the other places and have a special function for the one weird place. That said, the places where this is used are cleaner now (they weren't using os.path.join before) so let's keep it like it is for now.
|
|
||
| "Server": set(('Server', 'server')) | ||
| } | ||
|
|
There was a problem hiding this comment.
This whole dictionary is unnecessarily complicated. See below.
| - content Length | ||
|
|
||
| ... and any other header name, such as "Content-encoding", would not be | ||
| altered, regardless of spelling. |
| - contentlength | ||
| - content_Length | ||
| - contentLength | ||
| - content Length |
There was a problem hiding this comment.
Instead of the confusing dictionary way up above, why not:
_canonical_headers = [
"Accept-ranges",
"Content-length",
"Content-type",
"Date",
"Last-modified",
"Server",
]
# Annoying Python 2.6 compatible dict constructor
_lookup = dict((h.lower().replace('-', ''), h) for h in _canonical_headers)
def canonicalize(header_name):
key = re.sub(r'^([A-Za-z]+)[ _-]([A-Za-z]+)$', r'\1\2', header_name).lower()
return _lookup.get(key) # None if not presentJust lookup the lowercase, concatenated version and return the standard version instead of creating a massive, hard-to-read cross-product dictionary.
Or if you want something even simpler that doesn't require any fancy lookups:
def get_header(headers, header_name):
def canonical(name):
return re.sub(r'[ _-]', '', header_name).lower()
return next((v for h, v in headers.items()
if canonical(h) == canonical(header_name)), None)
...
# get any of the matching strings you care about
content_type = get_header(resp.headers, 'Content-type')| that of their corresponding item in headers, or if the keys of multiple | ||
| items in headers map to the same key after being standardized. | ||
|
|
||
| In all other cases headers is returned unaltered. |
There was a problem hiding this comment.
AFAICT, this function is used exactly once, in _s3_open. How many of these cases are relevant for that use case? I think this function does too much -- and it doesn't say what issue it's really trying to solve.
Unless _s3_open really needs all this, I'd much rather get rid of this whole thing and use one of the simpler approaches above for _s3_open so that we do not have to maintain all this code along with tests for every possible code path in here.
| if not self.archive_file: | ||
| raise NoArchiveFileError("Cannot call archive() before fetching.") | ||
|
|
||
| shutil.copyfile(self.archive_file, destination) |
There was a problem hiding this comment.
@opadron I think the following is causing #13404:
- The archive file is determined in part by the basename of the URL, in this case
libuuid-1.0.3.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Flibuuid%2F&ts=1433881396&use_mirror=iweb - Originally the
shutil.copyfilecommand happily used this as the source name of the file (which was weird, but it worked) - Now the
web.push_to_urlfunction is used, and it modifies the supplied local path withurl_util.parse(local_path), the formatted filename, stripped of the extra?r=http%3A%2F%2Fsourceforge.net..., does not exist (EDIT: added explanation for clarity)
I think that
web.push_to_urlshould not format the input path (the local file is what it is) i.e. remove theurl_util.parse(local_path)- arguably the archive filename should be formatted better (but technically that is not required) which is to say that the archive name should be based on
url_util.parseof the basename of the URL
Would just removing url_util.parse(local_path) cause problems that you know of?
…ed to it (#13408) fd58c98 formats the `Stage`'s `archive_path` in `Stage.archive` (as part of `web.push_to_url`). This is not needed and if the formatted differs from the original path (for example if the archive file name contains a URL query suffix), then the copy fails. This removes the formatting that occurs in `web.push_to_url`. We should figure out a way to handle bad cases like this *and* to have nicer filenames for downloaded files. One option that would work in this particular case would be to also pass `-J` / `--remote-header-name` to `curl`. We'll need to do follow-up work to determine if we can use `-J` everywhere. See also: #11117 (comment)
This extends Spack functionality so that it can fetch sources and binaries from-, push sources and binaries to-, and index the contents of- mirrors hosted on an S3 bucket.
High level to-do list:
- [x] Extend mirrors configuration to add support for `file://`, and `s3://` URLs.
- [x] Ensure all fetching, pushing, and indexing operations work for `file://` URLs.
- [x] Implement S3 source fetching
- [x] Implement S3 binary mirror indexing
- [x] Implement S3 binary package fetching
- [x] Implement S3 source pushing
- [x] Implement S3 binary package pushing
Important details:
* refactor URL handling to handle S3 URLs and mirror URLs more gracefully.
- updated parse() to accept already-parsed URL objects. an equivalent object
is returned with any extra s3-related attributes intact. Objects created with
urllib can also be passed, and the additional s3 handling logic will still be applied.
* update mirror schema/parsing (mirror can have separate fetch/push URLs)
* implement s3_fetch_strategy/several utility changes
* provide more feature-complete S3 fetching
* update buildcache create command to support S3
* Move the core logic for reading data from S3 out of the s3 fetch strategy and into
the s3 URL handler. The s3 fetch strategy now calls into `read_from_url()` Since
read_from_url can now handle S3 URLs, the S3 fetch strategy is redundant. It's
not clear whether the ideal design is to have S3 fetching functionality in a fetch
strategy, directly implemented in read_from_url, or both.
* expanded what can be passed to `spack buildcache` via the -d flag: In addition
to a directory on the local filesystem, the name of a configured mirror can be
passed, or a push URL can be passed directly.
…ed to it (spack#13408) fd58c98 formats the `Stage`'s `archive_path` in `Stage.archive` (as part of `web.push_to_url`). This is not needed and if the formatted differs from the original path (for example if the archive file name contains a URL query suffix), then the copy fails. This removes the formatting that occurs in `web.push_to_url`. We should figure out a way to handle bad cases like this *and* to have nicer filenames for downloaded files. One option that would work in this particular case would be to also pass `-J` / `--remote-header-name` to `curl`. We'll need to do follow-up work to determine if we can use `-J` everywhere. See also: spack#11117 (comment)
* fix docstring in generate_package_index() refering to "public" keys as "signing" keys * use explicit kwargs in push_to_url() * simplify url_util.parse() per tgamblin's suggestion * replace standardize_header_names() with the much simpler get_header() * add some basic tests * update s3_fetch tests * update S3 list code to strip leading slashes from prefix * correct minor warning regression introduced in #11117 * add more tests * flake8 fixes * add capsys fixture to mirror_crud test * add get_header() tests * use get_header() in more places * incorporate review comments
This extends Spack functionality so that it can fetch sources and binaries from-, push sources and binaries to-, and index the contents of- mirrors hosted on an S3 bucket.
Rough to-do list:
file://, ands3://URLs.file://URLs.