fetch_strategy: add support for wget by michaelkuhn · Pull Request #6967 · spack/spack

michaelkuhn · 2018-01-17T12:53:41Z

Some minimal operating system images only have wget installed (for example, Ubuntu). This adds support for wget to URLFetchStrategy and falls back to it if curl is not found.

This also removes support for the curl_options parameter and replaces it with a more generic headers parameter.

(I have also replaced the curl arguments with their long versions to make the code easier to understand.)

adamjstewart · 2018-01-17T16:16:22Z

var/spack/repos/builtin/packages/jdk/package.py

-        '-j',  # junk cookies
-        '-H',  # specify required License Agreement cookie
-        'Cookie: oraclelicense=accept-securebackup-cookie']
+    headers = ['Cookie: oraclelicense=accept-securebackup-cookie']


Does this still fetch after this change? I figured the License Agreement flag is required.

It works for me. Which flag do you mean? The cookie header still gets set like before the change.

michaelkuhn · 2018-03-22T09:34:11Z

@adamjstewart: Anything else I can do to get this merged? I have rebased it to the current develop branch and verified that it works (jdk still fetches etc.).

adamjstewart

Everything looks good to me. I don't have as much experience in the fetcher as I do elsewhere, so I'd like at least one other person to review this before merging. @tgamblin?

tgamblin · 2018-03-22T14:27:56Z

Question: would it be better and more robust if we just wrote our own fetch logic in urllib2 here? I ask because I am working on a multi-archive parallel fetcher and need a way to programmatically get progress information from the different downloads.

That is a pain, I believe, with both curl and wget. But I’m hesitant to switch from them because curl, at least, does some magic under the hood with headers (user agent, etc.) to make them more amenable to modern sites. It also isn’t clear to me whether curl or Python (urrlib2) is more likely to have SSL trust set up correctly. I’ve seen curl work when urllib doesn’t and vice versa.

Any opinions?

Some minimal operating system images only have wget installed (for example, Ubuntu). This adds support for wget to URLFetchStrategy and falls back to it if curl is not found. This also removes support for the curl_options parameter and replaces it with a more generic headers parameter.

michaelkuhn · 2018-07-11T07:56:51Z

Any opinions?

Sorry for not getting back to this earlier. I have now pushed a WIP commit that implements a fetcher using Requests (https://github.com/requests/requests). According to several sources, this is preferred over urllib2 etc.

It is still missing error handling and could be cleaned up a bit. I just wanted to get feedback whether this is something that we want to pursue.

To try it, simply change

self._fetcher = which('curl', 'wget')

to

self._fetcher = which('curl2', 'wget2')

and it will fall back to the Requests fetcher.

tgamblin · 2018-07-11T12:20:10Z

The only thing about requests is that it’s another dependency and I’d rather not vendor more if we don’t have to. Is requests more likely to work than urllib2? If not, I’m not too keen on including it when the standard library already has a decent alternative. Is there a functional advantage to requests? I know the API is great.

The main problems people face with spack are:

Either curl or wget isn’t installed
Either of these doesn’t have ssl set up properly
Python’s ssl isn’t set up properly, so we need to use curl or wget instead of urllib2.

I’d like a fetcher that doesn’t add a dependency, but that can fall back between these three options depending on which one works.

The only problem with this is that we have to implement basically three fetch methods in one, and I want to make a fetcher that can fetch multiple things at once (like docker pull). I have the multi-status bars for that implemented (from a month or two ago), but I haven’t submitted a PR yet, as I was trying to decide how to hook them up to the fetcher. Using only the built-in urllib2 would simplify the implementation considerably (no parsing curl/wget output), but there is still the problem that a lot of python installations don’t have SSL configured right, and the issue that some sites use curl options to configure spack to work with web proxies.

It might be worth implementing something to do a better job of finding the system certs and making Python aware of them (if that is possible), and figuring out how best to do proxies. Then we could just do urllib2. Or, if proxy setup is considerably easier with requests, that may make it a worthwhile option.

What do you think?

michaelkuhn · 2018-07-11T13:26:44Z

I just gave urllib2 a try and it feels very clunky to use. SSL support is problematic (no verification at all before Python 2.7.9 and you need to use internal functions to force is off in case verify_ssl if disabled), there are different imports for Python 2 and 3, and in general the supported features seem to depend a lot on the actual Python version. I have not tried to use a proxy with urllib2 but the interface seems similarly clunky.

If we want to throw out curl/wget and only use a Python module (which makes sense in my opinion), I do not think that urllib2 is the best option. It just feels like it could break at any time.

Requests seems to have everything we would want: https://github.com/requests/requests#feature-support (SSL verification, proxy support, connection pooling etc.)

tgamblin · 2018-07-11T13:34:29Z

Does requests do its own ssl and look on the system for certs in standard places? That would let us ditch the curl dependency altogether.

One other thing we’d need that I assume it supports: resuming prior downloads. We support that in the current fetcher.

How big is requests?

tgamblin · 2018-07-11T13:40:53Z

More on this: psf/requests#2966

What else do we need to bundle if we use requests? Looks like certifi and maybe certitude. I hope not too much else. I also worry about how much more reliable this will be than system curl, and that taking requests on means we’ll need to keep certs updated in spack (if we use cerifi). That could be a pain.

Thoughts?

michaelkuhn · 2018-07-17T12:55:59Z

Does requests do its own ssl and look on the system for certs in standard places? That would let us ditch the curl dependency altogether.

Looks like we could make Requests use the system certificates by passing a path to verify or by overriding requests.certs (should be pretty similar to what I have done in #6396). Fedora is doing something like this in its Requests package: https://src.fedoraproject.org/rpms/python-requests/blob/master/f/patch-requests-certs.py-to-use-the-system-CA-bundle.patch

One other thing we’d need that I assume it supports: resuming prior downloads. We support that in the current fetcher.

Apparently none of the Python modules support resumption out of the box but it should be pretty easy to do by just sending the Range header when necessary.

How big is requests?

Roughly 5,000 lines. The dependencies add quite a few lines, however. urllib3 alone is almost 10,000 lines.

What else do we need to bundle if we use requests? Looks like certifi and maybe certitude. I hope not too much else. I also worry about how much more reliable this will be than system curl, and that taking requests on means we’ll need to keep certs updated in spack (if we use cerifi). That could be a pain.

The only hard requirements seems to be chardet, idna, urllib3 and certifi (but certifi can be dropped if we do something like Fedora).

Thoughts?

I agree that vendoring so many dependencies is not optimal. I just took a closer look at urllib3, which seems to do most of the heavy lifting and has no dependencies.

I guess we could use urllib3 directly and make it use the system certificates (which will probably require some work to get right on all distributions).

michaelkuhn · 2018-07-17T13:02:57Z

Python 2 seems to be problematic with urllib3 as well: https://github.com/urllib3/urllib3/blob/master/docs/user-guide.rst#certificate-verification-in-python-2

michaelkuhn · 2018-10-12T16:25:15Z

@tgamblin Would you be willing to merge the wget support in its current form? If so, I would extract it into a separate PR and keep this one for playing around with native Python fetching.

adamjstewart · 2020-08-20T22:34:35Z

@tgamblin is this something we still want or should we close this PR? It's been inactive for quite some time.

michaelkuhn · 2020-08-21T07:17:04Z

I just went through my PRs yesterday and was planning to come back to this one today. I would strip it down to only the wget support, which might be nice to have for minimal environments (especially Debian/Ubuntu, which only ship wget by default). But if we don't want to support two fetchers, I'll just close it.

adamjstewart reviewed Jan 17, 2018

View reviewed changes

adamjstewart added the fetching label Jan 17, 2018

michaelkuhn force-pushed the wget-fetcher branch from ba0cf4d to 31c6566 Compare March 22, 2018 09:31

adamjstewart approved these changes Mar 22, 2018

View reviewed changes

michaelkuhn force-pushed the wget-fetcher branch from 31c6566 to 3c87ca1 Compare July 11, 2018 07:53

Implement fetcher based on Requests

6c2f822

michaelkuhn force-pushed the wget-fetcher branch from 3c87ca1 to 6c2f822 Compare July 11, 2018 08:32

baberlevi mentioned this pull request Sep 18, 2018

mark: Create new package. #7069

Merged

alalazo added the revisit label Nov 18, 2019

adamjstewart mentioned this pull request Dec 12, 2020

Adding options to fetching files by wget or curl #20350

Closed

adamjstewart mentioned this pull request Jan 30, 2021

Use Python's urllib to fetch if curl missing #21398

Merged

adamjstewart mentioned this pull request Feb 27, 2021

Combine URLFetchStrategy and CurlFetchStrategy into one class #22000

Merged

michaelkuhn closed this May 25, 2021

Conversation

michaelkuhn commented Jan 17, 2018

Uh oh!

adamjstewart Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

michaelkuhn Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

michaelkuhn commented Mar 22, 2018

Uh oh!

adamjstewart left a comment

Choose a reason for hiding this comment

Uh oh!

tgamblin commented Mar 22, 2018

Uh oh!

michaelkuhn commented Jul 11, 2018

Uh oh!

tgamblin commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelkuhn commented Jul 11, 2018

Uh oh!

tgamblin commented Jul 11, 2018

Uh oh!

tgamblin commented Jul 11, 2018

Uh oh!

michaelkuhn commented Jul 17, 2018

Uh oh!

michaelkuhn commented Jul 17, 2018

Uh oh!

michaelkuhn commented Oct 12, 2018

Uh oh!

adamjstewart commented Aug 20, 2020

Uh oh!

michaelkuhn commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tgamblin commented Jul 11, 2018 •

edited

Loading

michaelkuhn commented Aug 21, 2020 •

edited

Loading