Skip to content

Python: extracting package information from PyPI (WIP) #11587

@FRidh

Description

@FRidh

Introduction

As with some of the other languages that are available in Nix we would like to have the entire PyPI in Nix as well. Unfortunately, as explained during NixCon, Python packaging is a bit of a mess, with the biggest issue being that dependencies are not available.

I would like to discuss here how we could extract as much information from PyPI as possible, and how we would make that information available in Nix to build packages.

See also Python to-do list

What's available?

PyPI api

PyPi has two api's, JSON and XMLRPC.

With the api's we can directly and reliably extract:

  • name
  • version
  • description
  • available source archives with md5 (unfortunately not sha256)
  • description

With a bit of regex applied to the license field we can come up with a guess of the license as well.

source archive

If we would download the source files, then we can

  • determine sha256
  • come up with a much better guess of the license

pypi2nix

With pypi2nix we should be able to

  • extract build inputs, although I don't think we can distinguish between buildInputs and propagatedBuildInputs
  • come up with a much better guess of the license

How to implement in nixpkgs?

Eventually, we want to have a script that would automatically build as good as possible Nix expressions. Still, it is likely that many derivations need to be manually corrected, especially the cases when extension types are included.

So, how would this then look like in nixpkgs? Let's have a look at how other such collections in nixpkgs are implemented and maintained.

How are other languages/collections implemented?

Haskel

R

Perl

For Perl there is a tool that creates a Nix expression for a package from CPAN (so similar to pypi2nix). However, the index of packages is still maintained entirely manually.

KDE / Plasma 5

A nix expression is generated with an attribute set that contains per package the version, along with source url and sha256 hash, which are obtained by downloading all files. A function, plasmaPackage, is used to generate derivations based on these sets.

Other related issues

Issues to consider also are

Proposal

At the time of writing there are 70835 packages on PyPI. This will include a significant amount of broken and outdated packages as well. I think we only want to build derivations for those that are supposed to work.

We might want to consider two different stages here:

  1. when we use PyPI, and possibly (or optionally) download the source files as well;
  2. when we also use pypi2nix.

Stage 1

In stage 1 we use use just PyPI and optionally also the source archives. In this stage we want to only have derivations for those packages that are actually supposed to work. We could therefore have one single file (JSON, or nix set) with all the data that is collected from PyPI and possibly from the source archive. In python-packages.nix we then use buildPythonpackage, like we do now, in conjunction with the raw source data. For convenience, we might want to have a helper function buildPyPIPackage function (like plasmaPackage for KDE/Plasma) that calls buildPythonPackage and then override to modify/correct the derivation.

Files:

  • autogenerated index pypi_sources.json containing just sets.
  • curated pypi_curated.nix which contains derivations based on the entries in pypi_sources.json.
  • python-packages.nix which contains all Python derivations, that is the whole set of packages that is in pypi_curated.nix as well as those from a different source.

To have files of manageable size we could split the packages on initial letter of their names (again, see #11567 (comment)).

Stage 2

In this stage we use pypi2nix, or at least Python wheels, to extract relevant information. This extra info would then be fed back into pypi_sources.json.
We would also have a pypi_generated.nix that would contain automatically generated, working derivations, built from pypi_sources.json.

To be able to build a wheel, the actual inputs needs to be available. Obviously this can be problematic; numpy will most definitely not just work, however, many packages do depend on it. Therefore, for this process, we already also need a (minimal) curated pypi_curated.nix.

Files:

  • the above mentioned files
  • pypi_generated.nix with automatically generated and working derivations

More to follow...

Metadata

Metadata

Assignees

No one assigned

    Labels

    0.kind: enhancementAdd something new or improve an existing system.6.topic: pythonPython is a high-level, general-purpose programming language.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions