Skip to content

Add a PyPiFetchStrategy to properly download Python packages#2718

Closed
adamjstewart wants to merge 1 commit intospack:developfrom
adamjstewart:features/pypi
Closed

Add a PyPiFetchStrategy to properly download Python packages#2718
adamjstewart wants to merge 1 commit intospack:developfrom
adamjstewart:features/pypi

Conversation

@adamjstewart
Copy link
Copy Markdown
Member

Resolves #2281.

This PR is pretty hacky at the moment, so it will need a lot of work before it is properly integrated. But it is a totally working proof of concept:

$ spack fetch py-scipy
==> Fetching https://pypi.python.org/packages/22/41/b1538a75309ae4913cdbbdc8d1cc54cae6d37981d2759532c1aa37a41121/scipy-0.18.1.tar.gz
######################################################################## 100.0%

In the end, I'm hoping to support the following modes:

1. pypi passed to the version directive

Packages should be able to have multiple versions, some that download from git and some that download from PyPi. The following should work:

url = 'url'

version('1.2.3', 'hash')
version('1.2.4', 'hash', pypi='name')
version('1.2.5', 'hash', url='url')
version('1.2.6', 'hash', git='url', branch='branch')

2. pypi declared at the package level

If you have ten versions, you don't want to add pypi='name' to every single version directive. You should be able to declare it once at the package level.

3. packages named py-* without a URL?

If a URL isn't provided and the package starts with py-, why not call PyPiFetchStrategy?

4. packages that subclass PythonPackage without a URL?

Same as above. See #2709

if checksum:
kwargs['md5'] = checksum

kwargs['version'] = ver
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure how to access version in a FetchStrategy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... after 20 minutes of poking around the code, FetchStrategy is mysterious to me. I haven't been able to answer basic questions like:

  1. Where are different FetchStrategys instantiated? I can see URLFetchStrategy in from_url(). But a grep for GitFetchStrategy yields nothing obvious. Is the code being super-clever somewhere and turning the string "git" into the class GitFetchStrategy?

  2. What is the API for FetchStrategy? Clearly, the package name and version should be available to fetch strategies in a clean way. (As for package name... both the Spack package name and the download package name should be available. They are not always the same.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still mysterious to me as well. I believe this function determines which FetchStrategy is used, although I'm not sure where it is instantiated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good detective work! Can you please add the word FetchStrategy to the comments on that function, explaining this? That way when someone greps for FetchStrategy they will find it.

for release in releases:
if release['packagetype'] == 'sdist' or \
release['python_version'] == 'source':
return release['url']
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't need to be duplicated. I'll have to see why it is still calling url_for_version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FetchStrategy stuff is very opaque to me. I don't know when or how Spack decides which FetchStrategy ot instantiate. Maybe that's why.


# The xmlrpclib module has been renamed to xmlrpc.client in Python 3
# Spack does not yet support Python 3, but there's no reason not to be
# future-proof
Copy link
Copy Markdown
Member

@citibeth citibeth Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use six here, rather than rolling our own Python2-vs-Python3 compatibility? six is all in one file and MIT licensed, really easy (and legal) to just include with Spack. With six, we just need:

import six.moves.xmlrpc_client as xmlrpclib

But... where I work, we are not allowed to rename packages on import (for good reason). So I'd prefer:

from six.moves import xmlrpc_client

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding another external package seems like overkill for this. I'd rather just drop the Python 3 support.

if checksum:
kwargs['md5'] = checksum

kwargs['version'] = ver
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... after 20 minutes of poking around the code, FetchStrategy is mysterious to me. I haven't been able to answer basic questions like:

  1. Where are different FetchStrategys instantiated? I can see URLFetchStrategy in from_url(). But a grep for GitFetchStrategy yields nothing obvious. Is the code being super-clever somewhere and turning the string "git" into the class GitFetchStrategy?

  2. What is the API for FetchStrategy? Clearly, the package name and version should be available to fetch strategies in a clean way. (As for package name... both the Spack package name and the download package name should be available. They are not always the same.)


def __init__(self, pypi=None, digest=None, **kwargs):
# If pypi or digest are provided in the kwargs, then prefer
# those values.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pypi is provided in **kwargs, then Python would raise an exception and __init__() would never get called. So no need for the checking code...

# those values.
self.pypi = kwargs.get('pypi', None)
if not self.pypi:
self.pypi = pypi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three lines should just be: self.pypi = pypi.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know something is fishy here, but that's how URLFetchStrategy did things, so I copied it just to be safe. Honestly, all of the FetchStrategy logic needs to be reworked. As you've noticed, it's very opaque and not well documented.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing I noticed when I looked at Spack code was that it did not use normal Python kwarg declarations. I asked @tgamblin; he told me there was no good reason for this, and that we can start using normal Python. I recommend we do that; when you read a program and you come across something unusual, you ask "why". If there's no good reason for it, we're all better off without that unusual thing in place.

if not self.digest:
self.digest = digest

self.expand_archive = kwargs.get('expand', True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use standard Python for this. You should just add expand=True to the arg list. Then you just need self.expand_archive = expand here. Also.. please consider calling the variable self.expand, so you don't have spurious changes of variable name.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this line over from URLFetchStrategy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can fix URLFetchStrategy too. But even if we don't, no need to propagate this stuff.


self.digest = kwargs.get('md5', None)
if not self.digest:
self.digest = digest
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here.

super(PyPiFetchStrategy, self).__init__(
url=self.url, digest=self.digest, **kwargs)

def query_for_url(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see main comment section on this function...

method.__name__)


class PyPiFetchError(FetchError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is only instantiated once. Unless we explicitly check for it elsewhere, we don't really need this. Just put raise FetchError(...) where you have raise PyPiFetchError(...) above.

# If we have no idea, try to substitute the version.
return spack.url.substitute_version(
self.nearest_url(version), self.url_version(version))
if hasattr(cls, 'url') or self.version_urls():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't repeat the PyPI database code here. Must be factored...

return spack.url.substitute_version(
self.nearest_url(version), self.url_version(version))
elif hasattr(cls, 'pypi') or hasattr(self, 'pypi'):
client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the UI is to do:

class MyPackage(Package):
    url='http://...'

class MyPyPIPackage(Package):
    pypi='packagename'

Why not consider this alternative:

class MyPackage(Package):
    url='pypi://packagename'

I think we should at least consider the pros and cons of these two different UIs.

for release in releases:
if release['packagetype'] == 'sdist' or \
release['python_version'] == 'source':
return release['url']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FetchStrategy stuff is very opaque to me. I don't know when or how Spack decides which FetchStrategy ot instantiate. Maybe that's why.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017

My big comment is... please create a new file called lib/spack/spack/pypi.py with the contents below (for Python2; actually please update slightly using six). And then use it as appropriate. This addresses many issues:

  1. It factors out the PyPI code in a way that is independent of FetchStrategy, url_for_version(), etc. Cleaner, avoids having to repeat the code twice, more flexible, etc.

  2. If the PR consisted ONLY of this new file, users could still make use of this new PyPI fetching pretty easily. For example:

from spack import pypi
class PyNumpy(Package):
    ...
    def url_for_version(version):
        return pypi.get_url('numpy', version)
  1. Having looked at FetchStrategy and friends, I conclude it is a mess. If this were my PR, I would not be comfortable hacking deep into that mess (without first maybe trying to clean it up, document it, etc). And it seems to serve no purpose. Even with just pypi.py added, things are already pretty simple. They can be made even simpler by amending the default url_for_version() method as you did; now it would look like:
def url_for_version(self, version):
    ...
        if hasattr(cls, 'url') or self.version_urls():
            return spack.url.substitute_version(
                self.nearest_url(version), self.url_version(version))
        elif hasattr(cls, 'pypi') or hasattr(self, 'pypi'):
            return pypi.get_url(getattr(self, 'pypi'), version)

So... I'm wondering what messing with FetchStrategy is doing for us at this point; or if it's even having any effect at all in this PR so far.

  1. I think the logic in digging through the PyPI database is a wrong: we only want source files. Just because a file ends in .tar.gz doesn't mean we want it. Also, I think that the python_version=source thing looks like a hack, and I would ignore it. Finally... Spack has been set up to look for .tar.gz files and that has served us just fine. No need to root around for any other file extensions. Hence the simpler PyPI-digger below. If it actually fails on any package, I think we can revisit the logic at a later point. Until then, I think that simpler is better. And... no need to worry about the checksums, or to try to match them. As @tgamblin said, we do our own checksums, we can't let students grade their own homework.

Here is the Spack source file pypi.py:

from __future__ import print_function
import xmlrpclib


def get_url(name, version, extension='.tar.gz'):
    """Returns the URL needed to download a package from PyPI.
    name:
        Name of the package in PyPI
    version: (str)
        Version of the pacakge to download
    extension:
        Filename extension desired

    Raises KeyError if the desired PyPI URL cannot be located.

    See: https://wiki.python.org/moin/PyPIXmlRpc
    """

    client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
    releases = client.release_urls(name, version)

    good_releases = []
    for release in releases:
        if release['packagetype'] != 'sdist':
            continue
        if not release['filename'].endswith(extension):
            continue

        good_releases.append(release)

    if len(good_releases) == 0:
        raise KeyError('No PyPI URLs found for ({0}, {1}, {2})'.format(
            repr(name), repr(version), repr(extension)))
    if len(good_releases) > 1:
        raise KeyError('More than one PyPI URL found for ({0}, {1}, {2})'.format(
            repr(name), repr(version), repr(extension)))

    return good_releases[0]['url']

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@adamjstewart
Copy link
Copy Markdown
Member Author

Honestly, after seeing how bad FetchStrategy is, I'm wondering if we shouldn't just add a url_for_version to PythonPackage that handles this mess. Then we at least wouldn't have to duplicate.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017

Honestly, after seeing how bad FetchStrategy is, I'm wondering if we shouldn't just add a url_for_version to PythonPackage that handles this mess. Then we at least wouldn't have to duplicate.

I'd be happy to throw up a PR with that approach (cutting-n-pasting heavily from this PR), see what we think of it.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

How will it handle mirrors properly? Honestly I don't think that is the right approach. If you want some documentation for FetchStrategy, I could help with that.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

Also, here and here are where FetchStrategies are instantiated.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@adamjstewart
Copy link
Copy Markdown
Member Author

As for @citibeth's proposed get_url method, I like mine better 😃

no need to worry about the checksums, or to try to match them. As @tgamblin said, we do our own checksums, we can't let students grade their own homework.

All I'm doing is looking for a download that matches the package name, version, and checksum that were used when the version was added to the package. I think this is pretty safe and foolproof.

Spack has been set up to look for .tar.gz files and that has served us just fine.

This is in no way the case. There is no logic in Spack that prefers .tar.gz files. We use whatever extension is provided in the URL. In this case, since there is no URL, we have no way of knowing what extension was used when the package was created.

Of course, if PyPi is guaranteed to have a .tar.gz file for every package, then we could try to enforce only using that extension. But imagine this scenario. A user looks at a package like:

class py-numpy(PythonPackage):
    pypi = 'numpy'
    version('1.2.3', 'hash')

They want to add a new version, so they download 1.2.4.zip, run md5sum on it, and add it to the package.py. Now, when they try to install it, they get an error message saying that the checksum doesn't match. How do we prevent that?

Unfortunately, PyPi does not have a .tar.gz for every package. See 3to2 and pyDatalog for examples. It isn't clear to me how users would say that they want to use a .zip file, and I don't see any advantage to that over just using whatever checksum matches.

@adamjstewart
Copy link
Copy Markdown
Member Author

@tgamblin Would you be willing to refactor fetch_strategy.py and add documentation before I get started on this PR? I agree that kwargs is opaque and I have no way of knowing what to expect. I also don't know what is contained in self, and where to find version.

Another complaint I have is that FetchStrategy is only used during fetching, when there are many other cases that it would be useful (spack versions, spack checksum). Imagine these commands querying PyPi or GitHub or Subversion or Mercurial to get a list of available versions. At the moment, this simply isn't possible. We rely on url_for_version too much. In spack create, if I detect that the build system is Python and pypi appears in the URL, I want access to PyPiFetchStrategy to find available versions.

By the way, on the note of checksumming, why do we trust Git but not PyPi?

P.S. Just noticed that spack info now relies on having internet access for packages like this. Will this be a deal breaker?

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017

Spack has been set up to look for .tar.gz files and that has served us just fine.

This is in no way the case. There is no logic in Spack that prefers .tar.gz files. We use whatever extension is provided in the URL. In this case, since there is no URL, we have no way of knowing what extension was used when the package was created.

OK, I was not clear... what I meant is, we've been served just fine by URLs with .tar.gz (or whatever) hardcoded into them. When we try to fetch that URL, either the requested .tar.gz file is found, or the fetch fails. If there's an appropriate .zip or .tar.xz file on the server as well, our request will still fail. We don't root around the FTP server to try to find some other extension, we just let it fail. I am questioning why we need to do anything more complicated in this case, since the old way has served us just fine.

The function I provided allows for the desired extension to be specified. That provides the direct analog to way we already do things with URLs.

Of course, if PyPi is guaranteed to have a .tar.gz file for every package, then we could try to enforce only using that extension. But imagine this scenario. A user looks at a package like:

class py-numpy(PythonPackage):
    pypi = 'numpy'
    version('1.2.3', 'hash')

They want to add a new version, so they download 1.2.4.zip, run md5sum on it, and add it to the package.py. Now, when they try to install it, they get an error message saying that the checksum doesn't match. How do we prevent that?

The same problem could happen today: if I download zlib-1.2.4.zip from the zlib website but put zlib-1.2.4.tar.gz in the URL, things will break. I don't think this is a serious problem.

But... I think you are right, users should not be left guessing which extension will be downloaded. Therefore, the get_url() function should not specify any default for the extension parameter, it should look like:

def get_url(name, version, extension):
    ...

Unfortunately, PyPi does not have a .tar.gz for every package. See [3to2] (https://pypi.python.org/pypi/3to2) and pyDatalog for examples. It isn't clear to me how users would say that they want to use a .zip file

There are many ways this could happen. Using the low-tech approach (which it looks like we probably won't use because it messes up mirrors):

Package PyDatalog(PythonPackage):
    def url_for_version(version):
        return pypi.get_url('datalog', version, '.zip')

Or a more advanced riff, based on the UI you've suggested above:

Package PyDatalog(PythonPackage):
    pypi = ('datalog', '.zip')

Or even:

Package PyDatalog(PythonPackage):
    pypi = 'datalog.zip'

In this case, Spack would just turn 'datalog.zip' into ('datalog', '.zip') via os.path.splitext().
Or maybe (because it's similar to our other stuff and might look most familiar to users):

Package PyDatalog(PythonPackage):
    pypi = 'datalog-1.2.0.zip'

In this case, Spack would turn datalog-1.2.0.zip into ('datalog', '.zip') by parsing out and throwing away the version number.

and I don't see any advantage to that over just using whatever checksum matches.

Advantages:

  1. Don't use a complicated "smart" algorithm when a simple "dumb" algorithm will work. Smart algorithms can sometimes do surprising things, and can be harder to maintain and debug.
  2. We never needed these smarts in the past. If we KISS here, then we also keep things consistent with all other fetch strategies, which don't root around the remote repository for more than one thing: they either succeed or they fail.
  3. The low-level subroutine get_url() requires fewer arguments this way (no checksum needed). More disentanglement is almost always a good design choice. For example, my get_url() can be tested completely outside of Spack (that's what I did).
  4. It is clear to the user, and doesn't give preference to one extension over another (the original code gives preference to .tar.gz and .zip, but other things are possible as well).

The above discussion is about preference. But there's also a simple bug in the original that should be addressed: if nothing matches the given MD5 digest and there's a .tar.gz file providing something other than source, then that file will be returned.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017

Putting implementation details aside and thinking about UI... I'm coming to believe the best approach is to "create" a pypi:// URL type, and hide the FetchStrategy, etc. mechanics under the hood. Then the ONLY thing that will have to change for users is a new URL. For example, instead of:

class Py3to2(Package):
    ...
    url      = "https://pypi.python.org/packages/source/3/3to2/3to2-1.1.1.zip"

we do:

class Py3to2(Package):
    ...
    url      = "pypi://3to2-1.1.1.zip"

or even:

class Py3to2(Package):
    ...
    url      = "pypi://3to2.zip"

I think this will feel simplest and cleanest for users. @davydden do you have an opinion here?

Thinking about implementation just a bit... this kind of is still a URL fetch strategy because --- at the end of the day --- a URL is still constructed and fetched.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

My impression was that url_for_version would make an XML RPC call to the PyPI API. If it doesn't, and if the full URL can be constructed statically then we're mostly good. Spack just needs to be able to construct URLs a priori to function properly with a mirror.

One consequence of this that can cause issues is if the remote site uses a forwarding URL that does not include the file extension. The path to an archive in a mirror is pkgname-version.extension. The extension must stay the same so we know (easily) how to expand the file. We could expand and recompress with a known compression algorithm, but that makes it hard to know the checksum of what's in the mirror, so we just preserve the archive as-is. The compromise is that the packager is expected to supply an extension= arg to version() for projects that use extensionless forwarding URLs -- look at mfem.

If you can meet all those criteria then I think your plan will work. It is some of the oldest spack code but it actually works and even picks up downloads where they left off, so it will get you some useful functionality if you can do it this way.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@adamjstewart
Copy link
Copy Markdown
Member Author

adamjstewart commented Jan 3, 2017

We're discussing multiple issues here, let's please not confuse them.

How to Query PyPI Databse

My algorithm, 99.9% of the time, unless a hash changes, simply selects the download that matches the checksum. I find this much less likely to break.

That's never been the way we download things before. Up until now, we download things based on the name and extension. If we're going to change that, I believe we should discuss it a bit:

  1. Why just for PyPI packages? Or should we root around other download directories, looking for files that match our hash as well? Such a strategy would be possible on the GNU repos, for example.

  2. If it's just going to be for PyPI, that introduces inconsistency in the system. Is the inconsistency worth the benefits?

Your algorithm also breaks if multiple files are found that match the extension.

It explicitly fails in that case, by design. I'm hoping there will not be more than one listed file marked sdist (source distribution) with the same extension, listed under one version in the PyPI database. But you can never deduce things about other peoples' databases. Therefore, if it turns out that my assumption is invalid, we would want to know about it an examine the situation further. We don't want to just sweep it under the rug, unless we've examined such cases and decided that sweeping is the right thing to do here.

Your implementation requires packages to look something like:

The function I'm advocating for works under-the-hood, and makes no requirements on how packages will look.

How Packages Should Look

First of all, let's get straight that I'm my most preferred way for packages to look is (see above):

class Py3to2(Package):
    ...
    url      = "pypi://3to2.zip"

I don't like the fact that your implementation includes a fake URL. It will be confusing to users who are used to url's being, well, URLs.

I think this is worth a discussion. However, this would be by no means the first system to use not-so-common protocol specifiers in URLs. In the past I've seen smb://, afp://, scp://, git://, jdbc:sqlserver//, jdbc:oracle//, etc. Even the "orginal" URL spec included ftp://, http://, file:// and even gopher:// --- which was big for a year or two, just before the WWW killed it off in 1994. The idea was always that URLs are universal; the first part of the URL specifies the protocol. I conclude that using pypi:// to access things on the PyPI website is entirely in the spirit of what URLs were invented for, and are used for today.

If we really don't like custom protocols in URLs, an alternative I suggested would be something like this:

class Py3to2(Package):
    ...
    pypi = ('3to2', 'zip')

Auto-Defaults

A package is fetched based on a package name, version and file extension. Version is non-debatable. So the question is, should there be auto-defaults for the package name and the file extension? And if so, how should Spack determine those defaults?

Let's step back for a minute and think of other large repositories we download from --- GNU, Fedora, etc. Except for the hard-to-guess URLs, they share many relevant properties with PyPI:

  1. They provide many different downloads for each package/version --- including src/binary variants as well as different compression formats.
  2. They provide checksums for their downloads.
  3. They can be queried.

All of the automagic suggested in this PR could also be implemented for these other repositories. We could automagically guess remote package names (or encourage our package names to be the same as the remote ones). We could automagically determine filename extensions based on checksum. Yes, we could do all that automagic. But we don't. And we shouldn't. Because URLs work just fine, even though you have to manually specify the name of the package in the remote repo you are looking for --- and the extension as well.

I fail to see how your implementation is simpler or more intuitive.

Things are simpler and more intuitive when there is less automagic, and when more things are explicit. If every package states what it will download from PyPI, then things are crystal clear to the user.

Auto-Defaults for Extension

and will not need an extension.

I am not in favor of this. It's not that hard to type .tar.gz or .zip in your package.py file. And it makes things clear; one fewer thing that users have to know when things fail mysteriously. I fail to see the benefit of auto-deducing file extensions at runtime. (On the other hand, I DO see the benefit of this auto-deducing at spack create time).

Auto-Defauts for Name

My implementation will not need the name of the package (unless it differs from self.name[3:])

I'm also not in favor of this automagic; although I'm less inclined against it than with extensions. We have quite a number of packages that are named differently in Spack vs. PyPI. Maybe we shouldn't, and maybe we should try to come closer to PyPI. But that's not where we are right now.

Again... what's the benefit here?

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

@citibeth:

Is this a problem with PyPI? We're dealing with one site here.

I actually don't think the second part is an issue for PyPI, but I included the annoying corner case for completeness.

Edit: I might be wrong about this.

@adamjstewart
Copy link
Copy Markdown
Member Author

@tgamblin I'm not sure if I understand the problem with mirrors either. The download tarballs include the extension in their name. Won't CacheURLFetchStrategy be called regardless of it was initially downloaded via URLFetchStrategy or PyPiFetchStrategy?

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

@adamjstewart: yes, but I see a problem here -- you don't know the download extension until you talk to PyPI right? That would prevent us from knowing the file's name in advance.

Spack searches several places for your tarball:

  1. It uses a CacheURLFetchStrategy to look in var/spack/cache. That one is special because it doesn't copy the whole tarball into the stage if it doesn't have to -- it will symlink it there from your local download cache. The local cache is a mirror, but we know it is local so we can do this.
  2. It uses a URLFetchStrategy per mirror listed in mirrors.yaml. These use curl to grab tarballs from a remote URL.
  3. It asks fetch_strategy.for_package_version() to instantiate a fetch strategy for a particular package version. It does that by inspecting saved kwargs from a version() directive. It also has the Package object when it does that, so it could potentially look at other metadata.

Mirrors are specified only by their root, so the first two methods require Spack to construct a path for the archive within the mirror root. Currently, that is $root/$pkgname/$pkgname-$version.$extension. So, to fetch from a mirror:

  1. You need to know the package's name
  2. You need to know its version
  3. You need to know the file extension
  4. The checksum in the mirror needs to be the same as the checksum you use when fetching from the authoritative source.

3 and 4 are related -- we preserve the extension because we must preserve the original file intact, or its checksum will change. preserving the extension tells us how it was compressed so we can expand it after downloading.

If we can't figure out a way to get PyPI working, we may want to change (3). i.e., we might want to be able to save a file with a deterministic name but have Spack figure out how it was compressed after the fact (using file or something). I kind of don't like it b/c it makes the directory layout of a mirror more obtuse. I need to think about that one.

@adamjstewart: does that make sense?

@adamjstewart
Copy link
Copy Markdown
Member Author

@tgamblin Makes perfect sense. Thanks for the detailed explanation!

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017

If we can't figure out a way to get PyPI working, we may want to change (3). i.e., we might want to be able to save a file with a deterministic name but have Spack figure out how it was compressed after the fact (using file or something).

This is all way too complicated. We can just be explicit about which extension we are looking for on PyPI, just like we do with every other remote repo that we use. And if PyPI doesn't have that extension, it should fail, just like every other remote repo.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

This is all way too complicated. We can just be explicit about which extension we are looking for on PyPI, just like we do with every other remote repo that we use. And if PyPI doesn't have that extension, it should fail, just like every other remote repo.

@citibeth: Yes. Maybe it would be simpler to just provide the PyPI filename to version() along with the hash and package name, e.g.:

    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi=roundup, filename='roundup-1.1.2.tar.gz')

That would give you the information you need, and it might be easier for a user to construct.

Can someone fill me in on why using the regular old PyPI URL doesn't work? In the example linked from #2281, http://pypi.python.org/packages/source/r/roundup/roundup-1.1.2.tar.gz is the URL. Seems like that isn't so hard to handle with good old url=... don't we just want special querying for PyPI?

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

This may be better factored, then:

    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5', pypi='roundup', extension='tar.gz')

I think I like that better anyway; specifying the extension just seems like such a hack. Maybe .tar.gz could be the default and we could avoid it in most cases. Now the mirror URL can be constructed statically, and the PyPIFetchStrategy will need to negotiate for the long URL when doing a fetch directly from PyPI.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

@citibeth: what does the URL mean in your example? The user would have to know how to construct an entirely new URL that no other tool understands.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

@citibeth it bothers me that we're making a URL for PyPI. If PyPI wants to make URLs they can make their own URL and we can use it. But if it's in the URL field, it should have the same semantics as other URLs -- i.e. it should be fetchable by cURL. That one's not.

For PyPI we really only need the package name and version, and for mirrors we need the extension. All of those are provided explicitly, without parsing, via the kwargs to version(), and that's the protocol all the other fetchers use. They can be moved to the package level, so you could do this:

class Roundup(Package):
    pypi = 'roundup'
    extension = 'tar.gz'
    version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5')
    version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5')

But otherwise why invent a new way to specify something when we already have a perfectly good parameter system?

@adamjstewart
Copy link
Copy Markdown
Member Author

But there is an issue with that. I think we should be free to choose our own checksum algorithm for Spack. Specifically, I would like to switch to use sha256 in most places sometime in the next year, and if PyPI stays with md5 or something else, you won't be able to look up by hash anymore. I'd rather not tie my hashing scheme to PyPI's.

Can't really argue with that logic. Although that certainly doesn't preclude us from checking for a version that matches a checksum. If it exists, use it. If not, fall back to the previously discussed methods.

class Roundup(Package):
     pypi = 'roundup'
     extension = 'tar.gz'
     version('1.1.2', '7c395da56412e263d7600fa7f0afa2e5')
     version('1.1.1', '7c395da56412e263d7600fa7f0afa2e5')

That seems like a good compromise to me. Can we default to pypi = self.name[3:] and extension = '.tar.gz'? These could be put in the PythonPackage class and/or assumed for packages that start with py- and don't specify a URL.

One thing this solution doesn't solve is that a lot of commands call url_for_version, including spack info. This won't work on clusters without an internet connection. I guess we could tell those commands not to print out the URL for each version if we have to. But it is a nice touch. Maybe another command could be added to print out versions and corresponding URLs.

Either way, I think fetch_strategy.py is too poorly documented for me to get started. The fact that I had to duplicate my code in multiple places means it isn't very well designed. And I would like to start using these super-fancy fetch strategies for other commands like spack versions, spack checksum, and spack create.

I could hack Spack to get the Roundup class above working, but it would be nothing more than a hack. I would rather wait until fetch strategies are revamped. Thoughts?

@adamjstewart
Copy link
Copy Markdown
Member Author

Specifying the extension isn't that much of a hack I guess. We do it elsewhere for packages that don't have an extension (mfem, ncl).

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 3, 2017

@adamjstewart:

Can we default to pypi = self.name[3:] and extension = '.tar.gz'?

You need at least the pypi attribute in there to indicate that this should be fetched from PyPI. If the default for pypi is set in the PythonPackage class, that should be ok, but it has to have pypi someplace.

One thing this solution doesn't solve is that a lot of commands call url_for_version, including spack info. This won't work on clusters without an internet connection.

I think this should probably not rely on url_for_version, but override fetch() so that it does the negotiation. I believe those commands call it indirectly anyway, and they print something else out for packages that use, say, a git fetch strategy.

I would rather wait until fetch strategies are revamped. Thoughts?

I actually don't think this takes as much code as you think. I haven't reviewed the PR but i can take a look sometime this week, after 0.10 goes out (really. really really). I can then try to figure out what needs documenting and/or refactoring.

@adamjstewart
Copy link
Copy Markdown
Member Author

You need at least the pypi attribute in there to indicate that this should be fetched from PyPI. If the default for pypi is set in the PythonPackage class, that should be ok, but it has to have pypi someplace.

That's true for the current implementation, but I don't think it has to be. This could be negotiated. But I'm fine with getting things working the way you propose for now and making them more auto-magic down the road if we can agree on that.

I actually don't think this takes as much code as you think. I haven't reviewed the PR but i can take a look sometime this week, after 0.10 goes out (really. really really). I can then try to figure out what needs documenting and/or refactoring.

I'm fine with that. There's really no rush on this PR. We have a pretty easy workaround for the problem (specifying url in the version directive).

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 3, 2017 via email

@adamjstewart
Copy link
Copy Markdown
Member Author

adamjstewart commented Jan 3, 2017

@citibeth I agree that you shouldn't be able to specify multiple fetching schemes at the class level. As for how to enforce that, I'll leave it up to @tgamblin's refactoring.

But you ought to be able to do something like this:

class Numpy(Package):
    url = 'url'

    version('1.2.3', 'hash')
    version('1.2.4', 'hash', pypi='name', extension='.tar.gz')
    version('1.2.5', 'hash', url='url')
    version('1.2.6', 'hash', git='url', branch='branch')

In this case, the package-default is to use a URL. But a version-specific fetching scheme overrides this. Version fetching schemes have a higher priority than package fetching schemes, just like they do now if you specify multiple URLs. Except, wait, wtf is going on in #2725? Looks like this broke recently.

@adamjstewart adamjstewart mentioned this pull request Feb 9, 2017
@alalazo alalazo changed the title [WIP] Add a PyPiFetchStrategy to properly download Python packages Add a PyPiFetchStrategy to properly download Python packages Nov 21, 2017
@alalazo alalazo self-assigned this Nov 21, 2017
@adamjstewart
Copy link
Copy Markdown
Member Author

Abandoning this PR as it is obsolete. If you follow https://wiki.python.org/moin/PyPIXmlRpc you'll find that:

The XMLRPC interface for PyPI is considered legacy and should not be used.
Use the Simple and JSON APIs.

Documentation on the modern JSON APIs can be found at https://warehouse.readthedocs.io/api-reference/json/. I couldn't figure out how to use it, but if someone shows me how I can try to revive this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants