fetch_strategy: add support for wget#6967
Conversation
| '-j', # junk cookies | ||
| '-H', # specify required License Agreement cookie | ||
| 'Cookie: oraclelicense=accept-securebackup-cookie'] | ||
| headers = ['Cookie: oraclelicense=accept-securebackup-cookie'] |
There was a problem hiding this comment.
Does this still fetch after this change? I figured the License Agreement flag is required.
There was a problem hiding this comment.
It works for me. Which flag do you mean? The cookie header still gets set like before the change.
ba0cf4d to
31c6566
Compare
|
@adamjstewart: Anything else I can do to get this merged? I have rebased it to the current develop branch and verified that it works (jdk still fetches etc.). |
adamjstewart
left a comment
There was a problem hiding this comment.
Everything looks good to me. I don't have as much experience in the fetcher as I do elsewhere, so I'd like at least one other person to review this before merging. @tgamblin?
|
Question: would it be better and more robust if we just wrote our own fetch logic in urllib2 here? I ask because I am working on a multi-archive parallel fetcher and need a way to programmatically get progress information from the different downloads. That is a pain, I believe, with both curl and wget. But I’m hesitant to switch from them because curl, at least, does some magic under the hood with headers (user agent, etc.) to make them more amenable to modern sites. It also isn’t clear to me whether curl or Python (urrlib2) is more likely to have SSL trust set up correctly. I’ve seen curl work when urllib doesn’t and vice versa. Any opinions? |
Some minimal operating system images only have wget installed (for example, Ubuntu). This adds support for wget to URLFetchStrategy and falls back to it if curl is not found. This also removes support for the curl_options parameter and replaces it with a more generic headers parameter.
31c6566 to
3c87ca1
Compare
Sorry for not getting back to this earlier. I have now pushed a WIP commit that implements a fetcher using Requests (https://github.com/requests/requests). According to several sources, this is preferred over urllib2 etc. It is still missing error handling and could be cleaned up a bit. I just wanted to get feedback whether this is something that we want to pursue. To try it, simply change to and it will fall back to the Requests fetcher. |
3c87ca1 to
6c2f822
Compare
|
The only thing about requests is that it’s another dependency and I’d rather not vendor more if we don’t have to. Is requests more likely to work than urllib2? If not, I’m not too keen on including it when the standard library already has a decent alternative. Is there a functional advantage to requests? I know the API is great. The main problems people face with spack are:
I’d like a fetcher that doesn’t add a dependency, but that can fall back between these three options depending on which one works. The only problem with this is that we have to implement basically three fetch methods in one, and I want to make a fetcher that can fetch multiple things at once (like docker pull). I have the multi-status bars for that implemented (from a month or two ago), but I haven’t submitted a PR yet, as I was trying to decide how to hook them up to the fetcher. Using only the built-in It might be worth implementing something to do a better job of finding the system certs and making Python aware of them (if that is possible), and figuring out how best to do proxies. Then we could just do urllib2. Or, if proxy setup is considerably easier with requests, that may make it a worthwhile option. What do you think? |
|
I just gave urllib2 a try and it feels very clunky to use. SSL support is problematic (no verification at all before Python 2.7.9 and you need to use internal functions to force is off in case If we want to throw out curl/wget and only use a Python module (which makes sense in my opinion), I do not think that urllib2 is the best option. It just feels like it could break at any time. Requests seems to have everything we would want: https://github.com/requests/requests#feature-support (SSL verification, proxy support, connection pooling etc.) |
|
Does requests do its own ssl and look on the system for certs in standard places? That would let us ditch the curl dependency altogether. One other thing we’d need that I assume it supports: resuming prior downloads. We support that in the current fetcher. How big is requests? |
|
More on this: psf/requests#2966 What else do we need to bundle if we use requests? Looks like certifi and maybe certitude. I hope not too much else. I also worry about how much more reliable this will be than system curl, and that taking requests on means we’ll need to keep certs updated in spack (if we use cerifi). That could be a pain. Thoughts? |
Looks like we could make Requests use the system certificates by passing a path to
Apparently none of the Python modules support resumption out of the box but it should be pretty easy to do by just sending the Range header when necessary.
Roughly 5,000 lines. The dependencies add quite a few lines, however. urllib3 alone is almost 10,000 lines.
The only hard requirements seems to be chardet, idna, urllib3 and certifi (but certifi can be dropped if we do something like Fedora).
I agree that vendoring so many dependencies is not optimal. I just took a closer look at urllib3, which seems to do most of the heavy lifting and has no dependencies. I guess we could use urllib3 directly and make it use the system certificates (which will probably require some work to get right on all distributions). |
|
Python 2 seems to be problematic with urllib3 as well: https://github.com/urllib3/urllib3/blob/master/docs/user-guide.rst#certificate-verification-in-python-2 |
|
@tgamblin Would you be willing to merge the wget support in its current form? If so, I would extract it into a separate PR and keep this one for playing around with native Python fetching. |
|
@tgamblin is this something we still want or should we close this PR? It's been inactive for quite some time. |
|
I just went through my PRs yesterday and was planning to come back to this one today. I would strip it down to only the wget support, which might be nice to have for minimal environments (especially Debian/Ubuntu, which only ship wget by default). But if we don't want to support two fetchers, I'll just close it. |
Some minimal operating system images only have wget installed (for example, Ubuntu). This adds support for wget to URLFetchStrategy and falls back to it if curl is not found.
This also removes support for the
curl_optionsparameter and replaces it with a more genericheadersparameter.(I have also replaced the curl arguments with their long versions to make the code easier to understand.)