Skip to content

fetch_strategy: add support for wget#6967

Closed
michaelkuhn wants to merge 2 commits intospack:developfrom
michaelkuhn:wget-fetcher
Closed

fetch_strategy: add support for wget#6967
michaelkuhn wants to merge 2 commits intospack:developfrom
michaelkuhn:wget-fetcher

Conversation

@michaelkuhn
Copy link
Copy Markdown
Member

Some minimal operating system images only have wget installed (for example, Ubuntu). This adds support for wget to URLFetchStrategy and falls back to it if curl is not found.

This also removes support for the curl_options parameter and replaces it with a more generic headers parameter.

(I have also replaced the curl arguments with their long versions to make the code easier to understand.)

'-j', # junk cookies
'-H', # specify required License Agreement cookie
'Cookie: oraclelicense=accept-securebackup-cookie']
headers = ['Cookie: oraclelicense=accept-securebackup-cookie']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still fetch after this change? I figured the License Agreement flag is required.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for me. Which flag do you mean? The cookie header still gets set like before the change.

@michaelkuhn
Copy link
Copy Markdown
Member Author

@adamjstewart: Anything else I can do to get this merged? I have rebased it to the current develop branch and verified that it works (jdk still fetches etc.).

Copy link
Copy Markdown
Member

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. I don't have as much experience in the fetcher as I do elsewhere, so I'd like at least one other person to review this before merging. @tgamblin?

@tgamblin
Copy link
Copy Markdown
Member

Question: would it be better and more robust if we just wrote our own fetch logic in urllib2 here? I ask because I am working on a multi-archive parallel fetcher and need a way to programmatically get progress information from the different downloads.

That is a pain, I believe, with both curl and wget. But I’m hesitant to switch from them because curl, at least, does some magic under the hood with headers (user agent, etc.) to make them more amenable to modern sites. It also isn’t clear to me whether curl or Python (urrlib2) is more likely to have SSL trust set up correctly. I’ve seen curl work when urllib doesn’t and vice versa.

Any opinions?

Some minimal operating system images only have wget installed (for
example, Ubuntu). This adds support for wget to URLFetchStrategy and
falls back to it if curl is not found.

This also removes support for the curl_options parameter and replaces it
with a more generic headers parameter.
@michaelkuhn
Copy link
Copy Markdown
Member Author

Any opinions?

Sorry for not getting back to this earlier. I have now pushed a WIP commit that implements a fetcher using Requests (https://github.com/requests/requests). According to several sources, this is preferred over urllib2 etc.

It is still missing error handling and could be cleaned up a bit. I just wanted to get feedback whether this is something that we want to pursue.

To try it, simply change

self._fetcher = which('curl', 'wget')

to

self._fetcher = which('curl2', 'wget2')

and it will fall back to the Requests fetcher.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jul 11, 2018

The only thing about requests is that it’s another dependency and I’d rather not vendor more if we don’t have to. Is requests more likely to work than urllib2? If not, I’m not too keen on including it when the standard library already has a decent alternative. Is there a functional advantage to requests? I know the API is great.

The main problems people face with spack are:

  1. Either curl or wget isn’t installed
  2. Either of these doesn’t have ssl set up properly
  3. Python’s ssl isn’t set up properly, so we need to use curl or wget instead of urllib2.

I’d like a fetcher that doesn’t add a dependency, but that can fall back between these three options depending on which one works.

The only problem with this is that we have to implement basically three fetch methods in one, and I want to make a fetcher that can fetch multiple things at once (like docker pull). I have the multi-status bars for that implemented (from a month or two ago), but I haven’t submitted a PR yet, as I was trying to decide how to hook them up to the fetcher. Using only the built-in urllib2 would simplify the implementation considerably (no parsing curl/wget output), but there is still the problem that a lot of python installations don’t have SSL configured right, and the issue that some sites use curl options to configure spack to work with web proxies.

It might be worth implementing something to do a better job of finding the system certs and making Python aware of them (if that is possible), and figuring out how best to do proxies. Then we could just do urllib2. Or, if proxy setup is considerably easier with requests, that may make it a worthwhile option.

What do you think?

@michaelkuhn
Copy link
Copy Markdown
Member Author

I just gave urllib2 a try and it feels very clunky to use. SSL support is problematic (no verification at all before Python 2.7.9 and you need to use internal functions to force is off in case verify_ssl if disabled), there are different imports for Python 2 and 3, and in general the supported features seem to depend a lot on the actual Python version. I have not tried to use a proxy with urllib2 but the interface seems similarly clunky.

If we want to throw out curl/wget and only use a Python module (which makes sense in my opinion), I do not think that urllib2 is the best option. It just feels like it could break at any time.

Requests seems to have everything we would want: https://github.com/requests/requests#feature-support (SSL verification, proxy support, connection pooling etc.)

@tgamblin
Copy link
Copy Markdown
Member

Does requests do its own ssl and look on the system for certs in standard places? That would let us ditch the curl dependency altogether.

One other thing we’d need that I assume it supports: resuming prior downloads. We support that in the current fetcher.

How big is requests?

@tgamblin
Copy link
Copy Markdown
Member

More on this: psf/requests#2966

What else do we need to bundle if we use requests? Looks like certifi and maybe certitude. I hope not too much else. I also worry about how much more reliable this will be than system curl, and that taking requests on means we’ll need to keep certs updated in spack (if we use cerifi). That could be a pain.

Thoughts?

@michaelkuhn
Copy link
Copy Markdown
Member Author

Does requests do its own ssl and look on the system for certs in standard places? That would let us ditch the curl dependency altogether.

Looks like we could make Requests use the system certificates by passing a path to verify or by overriding requests.certs (should be pretty similar to what I have done in #6396). Fedora is doing something like this in its Requests package: https://src.fedoraproject.org/rpms/python-requests/blob/master/f/patch-requests-certs.py-to-use-the-system-CA-bundle.patch

One other thing we’d need that I assume it supports: resuming prior downloads. We support that in the current fetcher.

Apparently none of the Python modules support resumption out of the box but it should be pretty easy to do by just sending the Range header when necessary.

How big is requests?

Roughly 5,000 lines. The dependencies add quite a few lines, however. urllib3 alone is almost 10,000 lines.

What else do we need to bundle if we use requests? Looks like certifi and maybe certitude. I hope not too much else. I also worry about how much more reliable this will be than system curl, and that taking requests on means we’ll need to keep certs updated in spack (if we use cerifi). That could be a pain.

The only hard requirements seems to be chardet, idna, urllib3 and certifi (but certifi can be dropped if we do something like Fedora).

Thoughts?

I agree that vendoring so many dependencies is not optimal. I just took a closer look at urllib3, which seems to do most of the heavy lifting and has no dependencies.

I guess we could use urllib3 directly and make it use the system certificates (which will probably require some work to get right on all distributions).

@michaelkuhn
Copy link
Copy Markdown
Member Author

@michaelkuhn
Copy link
Copy Markdown
Member Author

@tgamblin Would you be willing to merge the wget support in its current form? If so, I would extract it into a separate PR and keep this one for playing around with native Python fetching.

@adamjstewart
Copy link
Copy Markdown
Member

@tgamblin is this something we still want or should we close this PR? It's been inactive for quite some time.

@michaelkuhn
Copy link
Copy Markdown
Member Author

michaelkuhn commented Aug 21, 2020

I just went through my PRs yesterday and was planning to come back to this one today. I would strip it down to only the wget support, which might be nice to have for minimal environments (especially Debian/Ubuntu, which only ship wget by default). But if we don't want to support two fetchers, I'll just close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants