Add a PyPiFetchStrategy to properly download Python packages by adamjstewart · Pull Request #2718 · spack/spack

adamjstewart · 2017-01-02T22:48:56Z

Resolves #2281.

This PR is pretty hacky at the moment, so it will need a lot of work before it is properly integrated. But it is a totally working proof of concept:

$ spack fetch py-scipy
==> Fetching https://pypi.python.org/packages/22/41/b1538a75309ae4913cdbbdc8d1cc54cae6d37981d2759532c1aa37a41121/scipy-0.18.1.tar.gz
######################################################################## 100.0%

In the end, I'm hoping to support the following modes:

1. `pypi` passed to the version directive

Packages should be able to have multiple versions, some that download from git and some that download from PyPi. The following should work:

url = 'url'

version('1.2.3', 'hash')
version('1.2.4', 'hash', pypi='name')
version('1.2.5', 'hash', url='url')
version('1.2.6', 'hash', git='url', branch='branch')

2. `pypi` declared at the package level

If you have ten versions, you don't want to add pypi='name' to every single version directive. You should be able to declare it once at the package level.

3. packages named `py-*` without a URL?

If a URL isn't provided and the package starts with py-, why not call PyPiFetchStrategy?

4. packages that subclass `PythonPackage` without a URL?

Same as above. See #2709

adamjstewart · 2017-01-02T22:49:22Z

lib/spack/spack/directives.py

        if checksum:
            kwargs['md5'] = checksum

+        kwargs['version'] = ver


Wasn't sure how to access version in a FetchStrategy.

Hmm... after 20 minutes of poking around the code, FetchStrategy is mysterious to me. I haven't been able to answer basic questions like:

Where are different FetchStrategys instantiated? I can see URLFetchStrategy in from_url(). But a grep for GitFetchStrategy yields nothing obvious. Is the code being super-clever somewhere and turning the string "git" into the class GitFetchStrategy?

What is the API for FetchStrategy? Clearly, the package name and version should be available to fetch strategies in a clean way. (As for package name... both the Spack package name and the download package name should be available. They are not always the same.)

It's still mysterious to me as well. I believe this function determines which FetchStrategy is used, although I'm not sure where it is instantiated.

Good detective work! Can you please add the word FetchStrategy to the comments on that function, explaining this? That way when someone greps for FetchStrategy they will find it.

adamjstewart · 2017-01-02T22:49:57Z

lib/spack/spack/package.py

+            for release in releases:
+                if release['packagetype']    == 'sdist' or \
+                   release['python_version'] == 'source':
+                    return release['url']


This shouldn't need to be duplicated. I'll have to see why it is still calling url_for_version.

The FetchStrategy stuff is very opaque to me. I don't know when or how Spack decides which FetchStrategy ot instantiate. Maybe that's why.

citibeth · 2017-01-03T02:49:33Z

lib/spack/spack/fetch_strategy.py

+
+# The xmlrpclib module has been renamed to xmlrpc.client in Python 3
+# Spack does not yet support Python 3, but there's no reason not to be
+# future-proof


Would it be better to use six here, rather than rolling our own Python2-vs-Python3 compatibility? six is all in one file and MIT licensed, really easy (and legal) to just include with Spack. With six, we just need:

import six.moves.xmlrpc_client as xmlrpclib

But... where I work, we are not allowed to rename packages on import (for good reason). So I'd prefer:

from six.moves import xmlrpc_client

Adding another external package seems like overkill for this. I'd rather just drop the Python 3 support.

citibeth · 2017-01-03T03:04:49Z

lib/spack/spack/directives.py

        if checksum:
            kwargs['md5'] = checksum

+        kwargs['version'] = ver


Hmm... after 20 minutes of poking around the code, FetchStrategy is mysterious to me. I haven't been able to answer basic questions like:

Where are different FetchStrategys instantiated? I can see URLFetchStrategy in from_url(). But a grep for GitFetchStrategy yields nothing obvious. Is the code being super-clever somewhere and turning the string "git" into the class GitFetchStrategy?

What is the API for FetchStrategy? Clearly, the package name and version should be available to fetch strategies in a clean way. (As for package name... both the Spack package name and the download package name should be available. They are not always the same.)

citibeth · 2017-01-03T03:07:14Z

lib/spack/spack/fetch_strategy.py

+
+    def __init__(self, pypi=None, digest=None, **kwargs):
+        # If pypi or digest are provided in the kwargs, then prefer
+        # those values.


If pypi is provided in **kwargs, then Python would raise an exception and __init__() would never get called. So no need for the checking code...

citibeth · 2017-01-03T03:07:52Z

lib/spack/spack/fetch_strategy.py

+        # those values.
+        self.pypi = kwargs.get('pypi', None)
+        if not self.pypi:
+            self.pypi = pypi


These three lines should just be: self.pypi = pypi.

I know something is fishy here, but that's how URLFetchStrategy did things, so I copied it just to be safe. Honestly, all of the FetchStrategy logic needs to be reworked. As you've noticed, it's very opaque and not well documented.

The first thing I noticed when I looked at Spack code was that it did not use normal Python kwarg declarations. I asked @tgamblin; he told me there was no good reason for this, and that we can start using normal Python. I recommend we do that; when you read a program and you come across something unusual, you ask "why". If there's no good reason for it, we're all better off without that unusual thing in place.

citibeth · 2017-01-03T03:08:55Z

lib/spack/spack/fetch_strategy.py

+        if not self.digest:
+            self.digest = digest
+
+        self.expand_archive = kwargs.get('expand', True)


Please use standard Python for this. You should just add expand=True to the arg list. Then you just need self.expand_archive = expand here. Also.. please consider calling the variable self.expand, so you don't have spurious changes of variable name.

I just copied this line over from URLFetchStrategy.

Maybe we can fix URLFetchStrategy too. But even if we don't, no need to propagate this stuff.

citibeth · 2017-01-03T03:09:05Z

lib/spack/spack/fetch_strategy.py

+
+        self.digest = kwargs.get('md5', None)
+        if not self.digest:
+            self.digest = digest


Same idea here.

citibeth · 2017-01-03T03:36:35Z

lib/spack/spack/fetch_strategy.py

+        super(PyPiFetchStrategy, self).__init__(
+            url=self.url, digest=self.digest, **kwargs)
+
+    def query_for_url(self):


Please see main comment section on this function...

citibeth · 2017-01-03T03:38:01Z

lib/spack/spack/fetch_strategy.py

            method.__name__)
+
+
+class PyPiFetchError(FetchError):


This class is only instantiated once. Unless we explicitly check for it elsewhere, we don't really need this. Just put raise FetchError(...) where you have raise PyPiFetchError(...) above.

citibeth · 2017-01-03T03:39:26Z

lib/spack/spack/package.py

-        # If we have no idea, try to substitute the version.
-        return spack.url.substitute_version(
-            self.nearest_url(version), self.url_version(version))
+        if hasattr(cls, 'url') or self.version_urls():


Please don't repeat the PyPI database code here. Must be factored...

citibeth · 2017-01-03T03:42:31Z

lib/spack/spack/package.py

+            return spack.url.substitute_version(
+                self.nearest_url(version), self.url_version(version))
+        elif hasattr(cls, 'pypi') or hasattr(self, 'pypi'):
+            client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')


So the UI is to do:

class MyPackage(Package): url='http://...' class MyPyPIPackage(Package): pypi='packagename'

Why not consider this alternative:

class MyPackage(Package): url='pypi://packagename'

I think we should at least consider the pros and cons of these two different UIs.

citibeth · 2017-01-03T03:44:02Z

lib/spack/spack/package.py

+            for release in releases:
+                if release['packagetype']    == 'sdist' or \
+                   release['python_version'] == 'source':
+                    return release['url']


The FetchStrategy stuff is very opaque to me. I don't know when or how Spack decides which FetchStrategy ot instantiate. Maybe that's why.

citibeth · 2017-01-03T03:58:44Z

My big comment is... please create a new file called lib/spack/spack/pypi.py with the contents below (for Python2; actually please update slightly using six). And then use it as appropriate. This addresses many issues:

It factors out the PyPI code in a way that is independent of FetchStrategy, url_for_version(), etc. Cleaner, avoids having to repeat the code twice, more flexible, etc.
If the PR consisted ONLY of this new file, users could still make use of this new PyPI fetching pretty easily. For example:

from spack import pypi
class PyNumpy(Package):
    ...
    def url_for_version(version):
        return pypi.get_url('numpy', version)

Having looked at FetchStrategy and friends, I conclude it is a mess. If this were my PR, I would not be comfortable hacking deep into that mess (without first maybe trying to clean it up, document it, etc). And it seems to serve no purpose. Even with just pypi.py added, things are already pretty simple. They can be made even simpler by amending the default url_for_version() method as you did; now it would look like:

def url_for_version(self, version):
    ...
        if hasattr(cls, 'url') or self.version_urls():
            return spack.url.substitute_version(
                self.nearest_url(version), self.url_version(version))
        elif hasattr(cls, 'pypi') or hasattr(self, 'pypi'):
            return pypi.get_url(getattr(self, 'pypi'), version)

So... I'm wondering what messing with FetchStrategy is doing for us at this point; or if it's even having any effect at all in this PR so far.

I think the logic in digging through the PyPI database is a wrong: we only want source files. Just because a file ends in .tar.gz doesn't mean we want it. Also, I think that the python_version=source thing looks like a hack, and I would ignore it. Finally... Spack has been set up to look for .tar.gz files and that has served us just fine. No need to root around for any other file extensions. Hence the simpler PyPI-digger below. If it actually fails on any package, I think we can revisit the logic at a later point. Until then, I think that simpler is better. And... no need to worry about the checksums, or to try to match them. As @tgamblin said, we do our own checksums, we can't let students grade their own homework.

Here is the Spack source file pypi.py:

from __future__ import print_function
import xmlrpclib


def get_url(name, version, extension='.tar.gz'):
    """Returns the URL needed to download a package from PyPI.
    name:
        Name of the package in PyPI
    version: (str)
        Version of the pacakge to download
    extension:
        Filename extension desired

    Raises KeyError if the desired PyPI URL cannot be located.

    See: https://wiki.python.org/moin/PyPIXmlRpc
    """

    client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
    releases = client.release_urls(name, version)

    good_releases = []
    for release in releases:
        if release['packagetype'] != 'sdist':
            continue
        if not release['filename'].endswith(extension):
            continue

        good_releases.append(release)

    if len(good_releases) == 0:
        raise KeyError('No PyPI URLs found for ({0}, {1}, {2})'.format(
            repr(name), repr(version), repr(extension)))
    if len(good_releases) > 1:
        raise KeyError('More than one PyPI URL found for ({0}, {1}, {2})'.format(
            repr(name), repr(version), repr(extension)))

    return good_releases[0]['url']

citibeth · 2017-01-03T04:08:16Z

On Mon, Jan 2, 2017 at 10:24 PM, Adam J. Stewart ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lib/spack/spack/fetch_strategy.py: > @@ -57,6 +57,15 @@ from spack.util.compression import decompressor_for, extension import spack.util.pattern as pattern + +# The xmlrpclib module has been renamed to xmlrpc.client in Python 3 +# Spack does not yet support Python 3, but there's no reason not to be +# future-proof Adding another external package seems like overkill for this. I'd rather just drop the Python 3 support.

No, we will need it someday, and then we'll be glad we've already included the single file `six.py`.

adamjstewart · 2017-01-03T15:26:50Z

Honestly, after seeing how bad FetchStrategy is, I'm wondering if we shouldn't just add a url_for_version to PythonPackage that handles this mess. Then we at least wouldn't have to duplicate.

citibeth · 2017-01-03T15:39:32Z

Honestly, after seeing how bad FetchStrategy is, I'm wondering if we shouldn't just add a url_for_version to PythonPackage that handles this mess. Then we at least wouldn't have to duplicate.

I'd be happy to throw up a PR with that approach (cutting-n-pasting heavily from this PR), see what we think of it.

tgamblin · 2017-01-03T15:42:58Z

How will it handle mirrors properly? Honestly I don't think that is the right approach. If you want some documentation for FetchStrategy, I could help with that.

tgamblin · 2017-01-03T15:45:53Z

Also, here and here are where FetchStrategies are instantiated.

citibeth · 2017-01-03T15:48:19Z

OK, let's take stock of what's good and bad about this PR so far. I'm happy with the `pypi.get_url()` function I wrote above (based on original code from @adamjstewart). I think we should use that function, or something close to it. If using this function directly in `url_for_version()` breaks mirrors, then we should not use it there (something both @adamjstewart and I have proposed, but maybe that's not what we want). At this point, I don't even understand why it would break things. I agree, more docs on `FetchStrategy` --- and how it relates to mirrors --- would be key to moving this PR forward.

…

On Tue, Jan 3, 2017 at 10:42 AM, Todd Gamblin ***@***.***> wrote: How will it handle mirrors properly? Honestly I don't think that is the right approach. If you want some documentation for FetchStrategy, I could help with that. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2718 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1cdyodibGuf_OhfYGz2inzW_5X24_jks5rOmyDgaJpZM4LZPkv> .

adamjstewart · 2017-01-03T16:04:20Z

As for @citibeth's proposed get_url method, I like mine better 😃

no need to worry about the checksums, or to try to match them. As @tgamblin said, we do our own checksums, we can't let students grade their own homework.

All I'm doing is looking for a download that matches the package name, version, and checksum that were used when the version was added to the package. I think this is pretty safe and foolproof.

Spack has been set up to look for .tar.gz files and that has served us just fine.

This is in no way the case. There is no logic in Spack that prefers .tar.gz files. We use whatever extension is provided in the URL. In this case, since there is no URL, we have no way of knowing what extension was used when the package was created.

Of course, if PyPi is guaranteed to have a .tar.gz file for every package, then we could try to enforce only using that extension. But imagine this scenario. A user looks at a package like:

class py-numpy(PythonPackage):
    pypi = 'numpy'
    version('1.2.3', 'hash')

They want to add a new version, so they download 1.2.4.zip, run md5sum on it, and add it to the package.py. Now, when they try to install it, they get an error message saying that the checksum doesn't match. How do we prevent that?

Unfortunately, PyPi does not have a .tar.gz for every package. See 3to2 and pyDatalog for examples. It isn't clear to me how users would say that they want to use a .zip file, and I don't see any advantage to that over just using whatever checksum matches.

adamjstewart · 2017-01-03T16:08:39Z

@tgamblin Would you be willing to refactor fetch_strategy.py and add documentation before I get started on this PR? I agree that kwargs is opaque and I have no way of knowing what to expect. I also don't know what is contained in self, and where to find version.

Another complaint I have is that FetchStrategy is only used during fetching, when there are many other cases that it would be useful (spack versions, spack checksum). Imagine these commands querying PyPi or GitHub or Subversion or Mercurial to get a list of available versions. At the moment, this simply isn't possible. We rely on url_for_version too much. In spack create, if I detect that the build system is Python and pypi appears in the URL, I want access to PyPiFetchStrategy to find available versions.

By the way, on the note of checksumming, why do we trust Git but not PyPi?

P.S. Just noticed that spack info now relies on having internet access for packages like this. Will this be a deal breaker?

citibeth · 2017-01-03T16:35:29Z

Spack has been set up to look for .tar.gz files and that has served us just fine.

This is in no way the case. There is no logic in Spack that prefers .tar.gz files. We use whatever extension is provided in the URL. In this case, since there is no URL, we have no way of knowing what extension was used when the package was created.

OK, I was not clear... what I meant is, we've been served just fine by URLs with .tar.gz (or whatever) hardcoded into them. When we try to fetch that URL, either the requested .tar.gz file is found, or the fetch fails. If there's an appropriate .zip or .tar.xz file on the server as well, our request will still fail. We don't root around the FTP server to try to find some other extension, we just let it fail. I am questioning why we need to do anything more complicated in this case, since the old way has served us just fine.

The function I provided allows for the desired extension to be specified. That provides the direct analog to way we already do things with URLs.

Of course, if PyPi is guaranteed to have a .tar.gz file for every package, then we could try to enforce only using that extension. But imagine this scenario. A user looks at a package like:
class py-numpy(PythonPackage):
    pypi = 'numpy'
    version('1.2.3', 'hash')
They want to add a new version, so they download 1.2.4.zip, run md5sum on it, and add it to the package.py. Now, when they try to install it, they get an error message saying that the checksum doesn't match. How do we prevent that?

The same problem could happen today: if I download zlib-1.2.4.zip from the zlib website but put zlib-1.2.4.tar.gz in the URL, things will break. I don't think this is a serious problem.

But... I think you are right, users should not be left guessing which extension will be downloaded. Therefore, the get_url() function should not specify any default for the extension parameter, it should look like:

def get_url(name, version, extension):
    ...

Unfortunately, PyPi does not have a .tar.gz for every package. See [3to2] (https://pypi.python.org/pypi/3to2) and pyDatalog for examples. It isn't clear to me how users would say that they want to use a .zip file

There are many ways this could happen. Using the low-tech approach (which it looks like we probably won't use because it messes up mirrors):

Package PyDatalog(PythonPackage):
    def url_for_version(version):
        return pypi.get_url('datalog', version, '.zip')

Or a more advanced riff, based on the UI you've suggested above:

Package PyDatalog(PythonPackage):
    pypi = ('datalog', '.zip')

Or even:

Package PyDatalog(PythonPackage):
    pypi = 'datalog.zip'

In this case, Spack would just turn 'datalog.zip' into ('datalog', '.zip') via os.path.splitext().
Or maybe (because it's similar to our other stuff and might look most familiar to users):

Package PyDatalog(PythonPackage):
    pypi = 'datalog-1.2.0.zip'

In this case, Spack would turn datalog-1.2.0.zip into ('datalog', '.zip') by parsing out and throwing away the version number.

and I don't see any advantage to that over just using whatever checksum matches.

Advantages:

Don't use a complicated "smart" algorithm when a simple "dumb" algorithm will work. Smart algorithms can sometimes do surprising things, and can be harder to maintain and debug.
We never needed these smarts in the past. If we KISS here, then we also keep things consistent with all other fetch strategies, which don't root around the remote repository for more than one thing: they either succeed or they fail.
The low-level subroutine get_url() requires fewer arguments this way (no checksum needed). More disentanglement is almost always a good design choice. For example, my get_url() can be tested completely outside of Spack (that's what I did).
It is clear to the user, and doesn't give preference to one extension over another (the original code gives preference to .tar.gz and .zip, but other things are possible as well).

The above discussion is about preference. But there's also a simple bug in the original that should be addressed: if nothing matches the given MD5 digest and there's a .tar.gz file providing something other than source, then that file will be returned.

citibeth · 2017-01-03T16:42:58Z

Putting implementation details aside and thinking about UI... I'm coming to believe the best approach is to "create" a pypi:// URL type, and hide the FetchStrategy, etc. mechanics under the hood. Then the ONLY thing that will have to change for users is a new URL. For example, instead of:

class Py3to2(Package):
    ...
    url      = "https://pypi.python.org/packages/source/3/3to2/3to2-1.1.1.zip"

we do:

class Py3to2(Package):
    ...
    url      = "pypi://3to2-1.1.1.zip"

or even:

class Py3to2(Package):
    ...
    url      = "pypi://3to2.zip"

I think this will feel simplest and cleanest for users. @davydden do you have an opinion here?

Thinking about implementation just a bit... this kind of is still a URL fetch strategy because --- at the end of the day --- a URL is still constructed and fetched.

citibeth · 2017-01-03T16:58:42Z

@tgamblin Can you please expand on the possible issues with mirrors? I'm incredulous that it would be such a problem. PyPI uses URLs just like other websites; the only difference is those URLs are hard to guess, just given the package name, version number and extension. So we go to the PyPI database to determine the URL. But once it's been determined, it is a perfectly valid URL that can be fetched with `URLFetchStrategy`. So once again... I'm convincing myself that we don't need a new fetch strategy, just a new way to construct URLs for `URLFetchStrategy`. Please tell me what I'm missing!

tgamblin · 2017-01-03T17:14:00Z

My impression was that url_for_version would make an XML RPC call to the PyPI API. If it doesn't, and if the full URL can be constructed statically then we're mostly good. Spack just needs to be able to construct URLs a priori to function properly with a mirror.

One consequence of this that can cause issues is if the remote site uses a forwarding URL that does not include the file extension. The path to an archive in a mirror is pkgname-version.extension. The extension must stay the same so we know (easily) how to expand the file. We could expand and recompress with a known compression algorithm, but that makes it hard to know the checksum of what's in the mirror, so we just preserve the archive as-is. The compromise is that the packager is expected to supply an extension= arg to version() for projects that use extensionless forwarding URLs -- look at mfem.

If you can meet all those criteria then I think your plan will work. It is some of the oldest spack code but it actually works and even picks up downloads where they left off, so it will get you some useful functionality if you can do it this way.

citibeth · 2017-01-03T17:31:19Z

On Tue, Jan 3, 2017 at 12:14 PM, Todd Gamblin ***@***.***> wrote: My impression was that url_for_version would make an XML RPC call to the PyPI API. If it doesn't, and if the full URL can be constructed statically then we're mostly good. Spack just needs to be able to construct URLs a priori to function properly with a mirror.

OK, that's the problem. Yes, it needs to go out to the web to find the URL.

One consequence of this that can cause issues is if the remote site uses a forwarding URL that does not include the file extension. The path to an archive in a mirror is pkgname-version.extension. The extension must stay the same so we know (easily) how to expand the file. We could expand and recompress with a known compression algorithm, but that makes it hard to know the checksum of what's in the mirror, so we just preserve the archive as-is. The compromise is that the packager is expected to supply an extension= arg to version() for projects that use extensionless forwarding URLs -- look at mfem.

Is this a problem with PyPI? We're dealing with one site here. I'm not even sure if they use forwarding URLs at all. If you can meet all those criteria then I think your plan will work.

I'm still left wondering in this case. Trying it out wouldn't be so hard, I suppose. Do you have suggestions on what test to do?

adamjstewart · 2017-01-03T19:15:19Z

We're discussing multiple issues here, let's please not confuse them.

How to Query PyPI Databse

My algorithm, 99.9% of the time, unless a hash changes, simply selects the download that matches the checksum. I find this much less likely to break.

That's never been the way we download things before. Up until now, we download things based on the name and extension. If we're going to change that, I believe we should discuss it a bit:

Why just for PyPI packages? Or should we root around other download directories, looking for files that match our hash as well? Such a strategy would be possible on the GNU repos, for example.
If it's just going to be for PyPI, that introduces inconsistency in the system. Is the inconsistency worth the benefits?

Your algorithm also breaks if multiple files are found that match the extension.

It explicitly fails in that case, by design. I'm hoping there will not be more than one listed file marked sdist (source distribution) with the same extension, listed under one version in the PyPI database. But you can never deduce things about other peoples' databases. Therefore, if it turns out that my assumption is invalid, we would want to know about it an examine the situation further. We don't want to just sweep it under the rug, unless we've examined such cases and decided that sweeping is the right thing to do here.

Your implementation requires packages to look something like:

The function I'm advocating for works under-the-hood, and makes no requirements on how packages will look.

How Packages Should Look

First of all, let's get straight that I'm my most preferred way for packages to look is (see above):

class Py3to2(Package):
    ...
    url      = "pypi://3to2.zip"

I don't like the fact that your implementation includes a fake URL. It will be confusing to users who are used to url's being, well, URLs.

I think this is worth a discussion. However, this would be by no means the first system to use not-so-common protocol specifiers in URLs. In the past I've seen smb://, afp://, scp://, git://, jdbc:sqlserver//, jdbc:oracle//, etc. Even the "orginal" URL spec included ftp://, http://, file:// and even gopher:// --- which was big for a year or two, just before the WWW killed it off in 1994. The idea was always that URLs are universal; the first part of the URL specifies the protocol. I conclude that using pypi:// to access things on the PyPI website is entirely in the spirit of what URLs were invented for, and are used for today.

If we really don't like custom protocols in URLs, an alternative I suggested would be something like this:

class Py3to2(Package):
    ...
    pypi = ('3to2', 'zip')

Auto-Defaults

A package is fetched based on a package name, version and file extension. Version is non-debatable. So the question is, should there be auto-defaults for the package name and the file extension? And if so, how should Spack determine those defaults?

Let's step back for a minute and think of other large repositories we download from --- GNU, Fedora, etc. Except for the hard-to-guess URLs, they share many relevant properties with PyPI:

They provide many different downloads for each package/version --- including src/binary variants as well as different compression formats.
They provide checksums for their downloads.
They can be queried.

All of the automagic suggested in this PR could also be implemented for these other repositories. We could automagically guess remote package names (or encourage our package names to be the same as the remote ones). We could automagically determine filename extensions based on checksum. Yes, we could do all that automagic. But we don't. And we shouldn't. Because URLs work just fine, even though you have to manually specify the name of the package in the remote repo you are looking for --- and the extension as well.

I fail to see how your implementation is simpler or more intuitive.

Things are simpler and more intuitive when there is less automagic, and when more things are explicit. If every package states what it will download from PyPI, then things are crystal clear to the user.

Auto-Defaults for Extension

and will not need an extension.

I am not in favor of this. It's not that hard to type .tar.gz or .zip in your package.py file. And it makes things clear; one fewer thing that users have to know when things fail mysteriously. I fail to see the benefit of auto-deducing file extensions at runtime. (On the other hand, I DO see the benefit of this auto-deducing at spack create time).

Auto-Defauts for Name

My implementation will not need the name of the package (unless it differs from self.name[3:])

I'm also not in favor of this automagic; although I'm less inclined against it than with extensions. We have quite a number of packages that are named differently in Spack vs. PyPI. Maybe we shouldn't, and maybe we should try to come closer to PyPI. But that's not where we are right now.

Again... what's the benefit here?

tgamblin · 2017-01-03T19:17:46Z

@citibeth:

Is this a problem with PyPI? We're dealing with one site here.

~~I actually don't think the second part is an issue for PyPI~~, but I included the annoying corner case for completeness.

Edit: I might be wrong about this.

adamjstewart · 2017-01-03T19:19:54Z

@tgamblin I'm not sure if I understand the problem with mirrors either. The download tarballs include the extension in their name. Won't CacheURLFetchStrategy be called regardless of it was initially downloaded via URLFetchStrategy or PyPiFetchStrategy?

tgamblin · 2017-01-03T19:38:03Z

@adamjstewart: yes, but I see a problem here -- you don't know the download extension until you talk to PyPI right? That would prevent us from knowing the file's name in advance.

Spack searches several places for your tarball:

It uses a CacheURLFetchStrategy to look in var/spack/cache. That one is special because it doesn't copy the whole tarball into the stage if it doesn't have to -- it will symlink it there from your local download cache. The local cache is a mirror, but we know it is local so we can do this.
It uses a URLFetchStrategy per mirror listed in mirrors.yaml. These use curl to grab tarballs from a remote URL.
It asks fetch_strategy.for_package_version() to instantiate a fetch strategy for a particular package version. It does that by inspecting saved kwargs from a version() directive. It also has the Package object when it does that, so it could potentially look at other metadata.

Mirrors are specified only by their root, so the first two methods require Spack to construct a path for the archive within the mirror root. Currently, that is $root/$pkgname/$pkgname-$version.$extension. So, to fetch from a mirror:

You need to know the package's name
You need to know its version
You need to know the file extension
The checksum in the mirror needs to be the same as the checksum you use when fetching from the authoritative source.

3 and 4 are related -- we preserve the extension because we must preserve the original file intact, or its checksum will change. preserving the extension tells us how it was compressed so we can expand it after downloading.

If we can't figure out a way to get PyPI working, we may want to change (3). i.e., we might want to be able to save a file with a deterministic name but have Spack figure out how it was compressed after the fact (using file or something). I kind of don't like it b/c it makes the directory layout of a mirror more obtuse. I need to think about that one.

@adamjstewart: does that make sense?

adamjstewart · 2017-01-03T20:00:45Z

@tgamblin Makes perfect sense. Thanks for the detailed explanation!

citibeth · 2017-01-03T20:02:21Z

If we can't figure out a way to get PyPI working, we may want to change (3). i.e., we might want to be able to save a file with a deterministic name but have Spack figure out how it was compressed after the fact (using file or something).

This is all way too complicated. We can just be explicit about which extension we are looking for on PyPI, just like we do with every other remote repo that we use. And if PyPI doesn't have that extension, it should fail, just like every other remote repo.

tgamblin · 2017-01-03T20:15:32Z

This is all way too complicated. We can just be explicit about which extension we are looking for on PyPI, just like we do with every other remote repo that we use. And if PyPI doesn't have that extension, it should fail, just like every other remote repo.

@citibeth: Yes. Maybe it would be simpler to just provide the PyPI filename to version() along with the hash and package name, e.g.:

    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi=roundup, filename='roundup-1.1.2.tar.gz')

That would give you the information you need, and it might be easier for a user to construct.

Can someone fill me in on why using the regular old PyPI URL doesn't work? In the example linked from #2281, http://pypi.python.org/packages/source/r/roundup/roundup-1.1.2.tar.gz is the URL. Seems like that isn't so hard to handle with good old url=... don't we just want special querying for PyPI?

citibeth · 2017-01-03T20:23:34Z

Can someone fill me in on why using the regular old PyPI URL doesn't work? In the example linked from #2281 <#2281>, http://pypi.python.org/ packages/source/r/roundup/roundup-1.1.2.tar.gz is the URL. Seems like that isn't so hard to handle with good old url=... don't we just want special querying for PyPI?

Because in many cases, the required URL has a hash or random string (not sure which) in it and cannot be constructed in an obvious way from the package name and version (the above example is a special case). That breaks `url_for_version()` and a package-level `url=`, and makes users specify a separate URL for each version manually. Which is really painful, especially because the resulting URLs are long and look bad. @citibeth <https://github.com/citibeth>: Yes. Maybe it would be simpler to

just provide the PyPI filename to version() along with the hash and package name, e.g.: version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi=roundup, filename='roundup-1.1.2.tar.gz') That would give you the information you need, and it might be easier for a user to construct.

This isn't bad for what it is; but it loses the convenience of having a single setting (like `url=`) that works automatically for the entire package. It will feel easier if users don't have to specify all this stuff for each version. Also, I'm not sure why `roundup` and the version number need to be specified twice above.

tgamblin · 2017-01-03T20:46:02Z

This may be better factored, then:

    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi='roundup', extension='tar.gz')

I think I like that better anyway; specifying the extension just seems like such a hack. Maybe .tar.gz could be the default and we could avoid it in most cases. Now the mirror URL can be constructed statically, and the PyPIFetchStrategy will need to negotiate for the long URL when doing a fetch directly from PyPI.

citibeth · 2017-01-03T20:52:32Z

On Tue, Jan 3, 2017 at 3:46 PM, Todd Gamblin ***@***.***> wrote: This may be better factored, then: version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi=roundup, extension='tar.gz') I think I like that better anyway; specifying the extension just seems like such a hack. Maybe .tar.gz could be the default and we could avoid it in most cases.

You like that better than: ``` class MyPackage(Package): url = 'pypi://roundup.tar.gz' version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5') ``` ???

Now the mirror URL can be constructed statically, and the PyPIFetchStrategy will need to negotiate for the long URL when doing a fetch directly from PyPI.

Sounds good to me:

tgamblin · 2017-01-03T21:17:59Z

@citibeth: what does the URL mean in your example? The user would have to know how to construct an entirely new URL that no other tool understands.

citibeth · 2017-01-03T21:21:40Z

I agree it's slightly cheesy. The non-cheesy way would be: ``` url='pypi://roundup-1.1.1.tar.gz' ``` This URL would have a well-defined meaning to fetch the PyPI file that is `package='roundup', version='1.1.1', extension='.tar.gz'`. It would be easy to create a standalone PyPI downloader that does exactly that. The fact that other tools don't understand the `pypi://` protocol doesn't bother me. Some tools understand the `jdbc:oracle//` protocol or `smb://` protocol, but Spack does not. That doesn't bother me either.

…

On Tue, Jan 3, 2017 at 4:18 PM, Todd Gamblin ***@***.***> wrote: @citibeth <https://github.com/citibeth>: what does the URL mean in your example? The user would have to know how to construct an entirely new URL that no other tool understands. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2718 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1cdwsIPCN05pqZYFLOXhV3q184Vzs2ks5rOrsIgaJpZM4LZPkv> .

tgamblin · 2017-01-03T21:26:37Z

@citibeth it bothers me that we're making a URL for PyPI. If PyPI wants to make URLs they can make their own URL and we can use it. But if it's in the URL field, it should have the same semantics as other URLs -- i.e. it should be fetchable by cURL. That one's not.

For PyPI we really only need the package name and version, and for mirrors we need the extension. All of those are provided explicitly, without parsing, via the kwargs to version(), and that's the protocol all the other fetchers use. They can be moved to the package level, so you could do this:

class Roundup(Package):
    pypi = 'roundup'
    extension = 'tar.gz'
    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5')
    version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5')

But otherwise why invent a new way to specify something when we already have a perfectly good parameter system?

adamjstewart · 2017-01-03T21:40:42Z

But there is an issue with that. I think we should be free to choose our own checksum algorithm for Spack. Specifically, I would like to switch to use sha256 in most places sometime in the next year, and if PyPI stays with md5 or something else, you won't be able to look up by hash anymore. I'd rather not tie my hashing scheme to PyPI's.

Can't really argue with that logic. Although that certainly doesn't preclude us from checking for a version that matches a checksum. If it exists, use it. If not, fall back to the previously discussed methods.

class Roundup(Package):
     pypi = 'roundup'
     extension = 'tar.gz'
     version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5')
     version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5')

That seems like a good compromise to me. Can we default to pypi = self.name[3:] and extension = '.tar.gz'? These could be put in the PythonPackage class and/or assumed for packages that start with py- and don't specify a URL.

One thing this solution doesn't solve is that a lot of commands call url_for_version, including spack info. This won't work on clusters without an internet connection. I guess we could tell those commands not to print out the URL for each version if we have to. But it is a nice touch. Maybe another command could be added to print out versions and corresponding URLs.

Either way, I think fetch_strategy.py is too poorly documented for me to get started. The fact that I had to duplicate my code in multiple places means it isn't very well designed. And I would like to start using these super-fancy fetch strategies for other commands like spack versions, spack checksum, and spack create.

I could hack Spack to get the Roundup class above working, but it would be nothing more than a hack. I would rather wait until fetch strategies are revamped. Thoughts?

adamjstewart · 2017-01-03T21:43:01Z

Specifying the extension isn't that much of a hack I guess. We do it elsewhere for packages that don't have an extension (mfem, ncl).

tgamblin · 2017-01-03T21:46:04Z

@adamjstewart:

Can we default to pypi = self.name[3:] and extension = '.tar.gz'?

You need at least the pypi attribute in there to indicate that this should be fetched from PyPI. If the default for pypi is set in the PythonPackage class, that should be ok, but it has to have pypi someplace.

One thing this solution doesn't solve is that a lot of commands call url_for_version, including spack info. This won't work on clusters without an internet connection.

I think this should probably not rely on url_for_version, but override fetch() so that it does the negotiation. I believe those commands call it indirectly anyway, and they print something else out for packages that use, say, a git fetch strategy.

I would rather wait until fetch strategies are revamped. Thoughts?

I actually don't think this takes as much code as you think. I haven't reviewed the PR but i can take a look sometime this week, after 0.10 goes out (really. really really). I can then try to figure out what needs documenting and/or refactoring.

adamjstewart · 2017-01-03T21:50:41Z

You need at least the pypi attribute in there to indicate that this should be fetched from PyPI. If the default for pypi is set in the PythonPackage class, that should be ok, but it has to have pypi someplace.

That's true for the current implementation, but I don't think it has to be. This could be negotiated. But I'm fine with getting things working the way you propose for now and making them more auto-magic down the road if we can agree on that.

I actually don't think this takes as much code as you think. I haven't reviewed the PR but i can take a look sometime this week, after 0.10 goes out (really. really really). I can then try to figure out what needs documenting and/or refactoring.

I'm fine with that. There's really no rush on this PR. We have a pretty easy workaround for the problem (specifying url in the version directive).

citibeth · 2017-01-03T22:06:30Z

``` class Roundup(Package): pypi = 'roundup' extension = '.tar.gz' version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5') version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5') ``` OK, LGTM. But my worry is... now that these non-URL schemes are proliferating, what happens if someone specifies both? For example: ``` class Roundup(Package): pypi = 'roundup' git = 'http://github.com/llnl/spack' url = 'http://myurl/roundup-1.1.1.tar.gz' extension = '.tar.gz' version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5') version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5') ``` Or what about this? ``` class Roundup(Package): pypi = 'roundup' extension = '.tar.gz' version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', git='....') version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5') ``` My general feeling is: 1. It should be illegal to specify more than one fetch type at the class level. 2. Due to (1), maybe we should be saying something like `fetch=(pypi, 'roundup')` or `fetch=pypi` or `fetch=(git, 'git://....')` at the class level --- a multi-valued variant idea. 3. Specifications at the `version` level should wipe out and completely override any specifications at the class level. So... if you specify `pypi` at the class level and `git` at the version level, all PyPI-related stuff at the class level will be ignore (so as to not trigger error checking in (1)). Can we default to pypi = self.name[3:] and extension = '.tar.gz'?

You need at least the pypi attribute in there to indicate that this should be fetched from PyPI. If the default for pypi is set in the PythonPackage class, that should be ok, but it has to have pypi someplace.

Not all Python packages (meaning, packages built with `setup.py`) install from PyPI. And of those that do, sometimes we want to install from a location other than PyPI. So I don't think that `PythonPackage` should default `pypi`, otherwise how would we create a `PythonPackage` NOT from PyPI (in an obvious, straightforward way)? Maybe `PyPIPackage` can, if we want that superclass. But again... if we don't want to tie our hashes to PyPI, why would we tie our naming sheme to PyPI?

adamjstewart · 2017-01-03T22:44:10Z

@citibeth I agree that you shouldn't be able to specify multiple fetching schemes at the class level. As for how to enforce that, I'll leave it up to @tgamblin's refactoring.

But you ought to be able to do something like this:

class Numpy(Package):
    url = 'url'

    version('1.2.3', 'hash')
    version('1.2.4', 'hash', pypi='name', extension='.tar.gz')
    version('1.2.5', 'hash', url='url')
    version('1.2.6', 'hash', git='url', branch='branch')

In this case, the package-default is to use a URL. But a version-specific fetching scheme overrides this. Version fetching schemes have a higher priority than package fetching schemes, just like they do now if you specify multiple URLs. Except, wait, wtf is going on in #2725? Looks like this broke recently.

adamjstewart · 2018-08-18T20:17:37Z

Abandoning this PR as it is obsolete. If you follow https://wiki.python.org/moin/PyPIXmlRpc you'll find that:

The XMLRPC interface for PyPI is considered legacy and should not be used.
Use the Simple and JSON APIs.

Documentation on the modern JSON APIs can be found at https://warehouse.readthedocs.io/api-reference/json/. I couldn't figure out how to use it, but if someone shows me how I can try to revive this PR.

Add a PyPiFetchStrategy to properly download Python packages

2640933

adamjstewart added fetching python WIP labels Jan 2, 2017

adamjstewart commented Jan 2, 2017

View reviewed changes

citibeth mentioned this pull request Jan 3, 2017

Downloading protocols are inconsistent for the GNU site #2716

Closed

citibeth reviewed Jan 3, 2017

View reviewed changes

citibeth mentioned this pull request Jan 5, 2017

Auto-Generation of Python Packages from PyPI and Anaconda #2749

Closed

This was referenced Jan 5, 2017

Auto-Generation of Spack Packages from PyPI and Anaconda #2750

Closed

Preliminary RPackage class #2761

Merged

adamjstewart mentioned this pull request Feb 9, 2017

URL's from PyPI #3109

Closed

adamjstewart mentioned this pull request Mar 9, 2017

Adding new package: py-4suite-xml #3389

Merged

adamjstewart mentioned this pull request Apr 6, 2017

Update version info to include v1.5.5 #3743

Merged

alalazo changed the title ~~[WIP] Add a PyPiFetchStrategy to properly download Python packages~~ Add a PyPiFetchStrategy to properly download Python packages Nov 21, 2017

alalazo self-assigned this Nov 21, 2017

adamjstewart mentioned this pull request Mar 27, 2018

Fix download URL for SCons and add latest version #7614

Merged

adamjstewart mentioned this pull request Apr 8, 2018

Add package py-slurm-pipeline #7681

Merged

adamjstewart mentioned this pull request Jul 19, 2018

Allow packages without URLs #3161

Merged

adamjstewart closed this Aug 18, 2018

adamjstewart deleted the features/pypi branch October 20, 2019 01:41

adamjstewart mentioned this pull request Mar 6, 2020

Automatically detect import_modules in PythonPackage #15370

Closed

3 tasks

adamjstewart mentioned this pull request Aug 30, 2020

python packages: spack create should automatically convert 'pythonhosted.org...' urls to 'pypi.io...' #18397

Closed

adamjstewart mentioned this pull request Dec 23, 2020

PythonPackage: add pypi attribute to infer homepage/url/list_url #17587

Merged

5 tasks

haampie mentioned this pull request Jan 9, 2025

PyPI simplified URLs can change format #48479

Open

Conversation

adamjstewart commented Jan 2, 2017

1. pypi passed to the version directive

2. pypi declared at the package level

3. packages named py-* without a URL?

4. packages that subclass PythonPackage without a URL?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

citibeth Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

citibeth commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

citibeth commented Jan 3, 2017 via email

Uh oh!

adamjstewart commented Jan 3, 2017

Uh oh!

citibeth commented Jan 3, 2017

Uh oh!

tgamblin commented Jan 3, 2017

Uh oh!

tgamblin commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

citibeth commented Jan 3, 2017 via email

Uh oh!

adamjstewart commented Jan 3, 2017

Uh oh!

adamjstewart commented Jan 3, 2017

Uh oh!

citibeth commented Jan 3, 2017

Uh oh!

citibeth commented Jan 3, 2017

Uh oh!

citibeth commented Jan 3, 2017 via email

Uh oh!

tgamblin commented Jan 3, 2017

1. `pypi` passed to the version directive

2. `pypi` declared at the package level

3. packages named `py-*` without a URL?

4. packages that subclass `PythonPackage` without a URL?

citibeth Jan 3, 2017 •

edited

Loading

citibeth commented Jan 3, 2017 •

edited

Loading

tgamblin commented Jan 3, 2017 •

edited

Loading

adamjstewart commented Jan 3, 2017 •

edited by citibeth

Loading

tgamblin commented Jan 3, 2017 •

edited

Loading

tgamblin commented Jan 3, 2017 •

edited

Loading

tgamblin commented Jan 3, 2017 •

edited by adamjstewart

Loading

tgamblin commented Jan 3, 2017 •

edited

Loading