-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
Description
Introduction
As with some of the other languages that are available in Nix we would like to have the entire PyPI in Nix as well. Unfortunately, as explained during NixCon, Python packaging is a bit of a mess, with the biggest issue being that dependencies are not available.
I would like to discuss here how we could extract as much information from PyPI as possible, and how we would make that information available in Nix to build packages.
See also Python to-do list
What's available?
PyPI api
PyPi has two api's, JSON and XMLRPC.
With the api's we can directly and reliably extract:
- name
- version
- description
- available source archives with md5 (unfortunately not sha256)
- description
With a bit of regex applied to the license field we can come up with a guess of the license as well.
source archive
If we would download the source files, then we can
- determine sha256
- come up with a much better guess of the license
pypi2nix
With pypi2nix we should be able to
- extract build inputs, although I don't think we can distinguish between
buildInputsandpropagatedBuildInputs - come up with a much better guess of the license
How to implement in nixpkgs?
Eventually, we want to have a script that would automatically build as good as possible Nix expressions. Still, it is likely that many derivations need to be manually corrected, especially the cases when extension types are included.
So, how would this then look like in nixpkgs? Let's have a look at how other such collections in nixpkgs are implemented and maintained.
How are other languages/collections implemented?
Haskel
R
Perl
For Perl there is a tool that creates a Nix expression for a package from CPAN (so similar to pypi2nix). However, the index of packages is still maintained entirely manually.
KDE / Plasma 5
A nix expression is generated with an attribute set that contains per package the version, along with source url and sha256 hash, which are obtained by downloading all files. A function, plasmaPackage, is used to generate derivations based on these sets.
Other related issues
Issues to consider also are
- size of files. While it is likely not a problem if an autogenerated file is huge, files that need to be modified by us, e.g. to override what is autogenerated, should have a manageable size. See also discussion in Python: Move packages from all-packages.nix to python-packages.nix #11567 (comment)
Proposal
At the time of writing there are 70835 packages on PyPI. This will include a significant amount of broken and outdated packages as well. I think we only want to build derivations for those that are supposed to work.
We might want to consider two different stages here:
- when we use PyPI, and possibly (or optionally) download the source files as well;
- when we also use
pypi2nix.
Stage 1
In stage 1 we use use just PyPI and optionally also the source archives. In this stage we want to only have derivations for those packages that are actually supposed to work. We could therefore have one single file (JSON, or nix set) with all the data that is collected from PyPI and possibly from the source archive. In python-packages.nix we then use buildPythonpackage, like we do now, in conjunction with the raw source data. For convenience, we might want to have a helper function buildPyPIPackage function (like plasmaPackage for KDE/Plasma) that calls buildPythonPackage and then override to modify/correct the derivation.
Files:
- autogenerated index
pypi_sources.jsoncontaining just sets. - curated
pypi_curated.nixwhich contains derivations based on the entries inpypi_sources.json. python-packages.nixwhich contains all Python derivations, that is the whole set of packages that is inpypi_curated.nixas well as those from a different source.
To have files of manageable size we could split the packages on initial letter of their names (again, see #11567 (comment)).
Stage 2
In this stage we use pypi2nix, or at least Python wheels, to extract relevant information. This extra info would then be fed back into pypi_sources.json.
We would also have a pypi_generated.nix that would contain automatically generated, working derivations, built from pypi_sources.json.
To be able to build a wheel, the actual inputs needs to be available. Obviously this can be problematic; numpy will most definitely not just work, however, many packages do depend on it. Therefore, for this process, we already also need a (minimal) curated pypi_curated.nix.
Files:
- the above mentioned files
pypi_generated.nixwith automatically generated and working derivations