Skip to content

WIP: Install python applications into virtualenvs#8364

Closed
hartzell wants to merge 2 commits intospack:developfrom
hartzell:python-virtualenvs
Closed

WIP: Install python applications into virtualenvs#8364
hartzell wants to merge 2 commits intospack:developfrom
hartzell:python-virtualenvs

Conversation

@hartzell
Copy link
Copy Markdown
Contributor

@hartzell hartzell commented Jun 4, 2018

[I hit the green "submit" button too quickly and had to edit this comment to add, well, all of it...]

I'd like to explore adding the option of installing Python applications into Python virtualenvs, making their Python dependencies "build" only and giving them some of the robustness that rpath's bring to compiled applications (no need to depend on environment variables to find what they need at runtime).

I've mentioned this on the Spack google group and gotten some feedback from @healther and @citibeth, but most of what the bits they discussed involved ways to set up the environment. I believe that installing into virtualenvs is orthogonal to the work they pointed me at.

I was recently exposed to Homebrew's use of virtualenvs when I submitted a Formula for bumpversion.

This PR/branch is a hack to demonstrate how it might behave. It is not a final implementation.

With that said, in a clone with this branch checked out and nothing else installed, one can (tested on CentOS 7):

spack install py-virtualenv
spack activate py-virtualenv
spack install py-flake8
spack install httpie
spack install bumpversion

The final 3 installations don't complete happily, but they finish the important POC bits. In each prefix there will be a libexec directory (e.g. .../spack-virtualenv/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/py-flake8-3.5.0-yqduo7ftn2ucmnamrn3lrlwkdsx7d4a7/libexec/) that has a bin subdir that contains the application. E.g.

/home/hartzell/tmp/spack-virtualenv/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/py-flake8-3.5.0-yqduo7ftn2ucmnamrn3lrlwkdsx7d4a7/libexec/bin/flake8

That flake8 will run with no special environment settings, in fact, adding that bin directory to the PATH is enough to run spack flake8 (and then fix the uglies...).

A final implementation (stealing from Homebrew) would link the applications into prefix/bin, keeping the libexec dir private.

I had trouble finding a pleasant way to pass the fact that a virtualenv was being used from the top level item (e.g. py-flake8) into the other layers.

One clean idea would be for the top level app to depend on Python and specify a variant to Python that included the name of the virtualenv. I'm not sure how to get that to concretize cleanly with all of the dependencies, which are simply depending on Python. Perhaps a virtual dependency?

Alternatively, setting a boolean in the top-level application and figuring out how to pass the info down through its do_install and into the layers below seems to be the best bet. Perhaps there's a way to adjust the spec's or to pass alone the extra info. It's tempting to do something with the Python package's set_dependent_package, but I couldn't figure it out nicely.

I would love feedback on how this might happen.

I think that python packages being installed into virtualenvs should have their prefix adjusted to point into the top level app's prefix (so that they don't clash with any real installs), they should be recorded in the db as part of a virtualenv install (or perhaps not as all).

In conclusion: if this works, then applications can be self-contained and more reliable (avoiding e.g. the py-flake8 module loading issue). The resulting packages will play nicely with Environments and etc.... It's complementary to Spack's other methods for handling add-on packages (environment modification, activation, views) and would still use Spack's package definitions, preserving reproducibility and etc...

Feedback?

hartzell added 2 commits June 4, 2018 01:11
- able to install httpie (and it's prereqs), bumpversion and py-flake
  into virtualenvs rooted at their prefix + /libexec.

- their prefix + /.spack dir isn't being created, so the final
  log writing step's commented out.

I couldn't find a good way to pass the fact that a virtual environment
was/should be used to all of the moving bits of machinery and for the
sake of prototyping I just stuffed it into the
environment (`VENV_PATH`).

Wheee....
@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jun 4, 2018 via email

@hartzell
Copy link
Copy Markdown
Contributor Author

hartzell commented Jun 4, 2018

[edits: typos and clarification]

This will allow you to install as many py-virtualenv instances as you
like, with different stuff in each one of them.

I'm not sure what that would acheive achieve. We only need one copy of the virtualenv library. If Guido were perfect, he would have included it in the Python core back in the day, but he wasn't and it isn't so we need to install and activate it (or module load it, or ...) ourselves. Kind of like how I wish Larry had included cpanm in the Perl core when he release Perl 5.

For Python3, the equivalent library is in the core (venv), although I gather that there are people who think virtualenv is still the right way to do it there (I'll try to avoid the Perl guy's snide remark here...).

But how we install the virtual-environment making tool is an implementation detail.

Each application should have its own virtualenv tucked inside its prefix (which is how Homebrew does it). That virtualenv is created by virtualenv (installed via our py-virtualenv package) or venv for [email protected]:. All of the dependencies end up installed within it. The scripts for the application itself (e.g. flake8 or http or bumpversion) would also get linked into their corresponding prefix/bin dirs.

It would be nice if e.g. the httpie application could say depends_on('python', +venv) (or depends_on('python', venv=prefix)) and have that tell python to automagically set up the virtual environment and promulgate it (via its setup_dependent_packages hook) to the other Python-using packages in the spec dependency graph.

The others would continue to blithely use depends_on('python'), they'd install into the virtualenv because the Python executable they're given to use is /path/to/httpie-prefix/libexec/bin/python.

But, I'm afraid that there's some sort of concretization incompatibility with having different "flavors" of python dependencies (one +venv, the others not) within the same spec.

@healther
Copy link
Copy Markdown
Contributor

healther commented Jun 5, 2018

I didn't have a chance to look at the implementation yet, so some of these questions might clear up once I read through it.

These are from what I thought while reading through you proposal without particular order:

  • Currently, we don't distinguish between python libraries (like py-numpy) and python executables (like py-flake8), how would a user know which one it is?
  • What about packages that provide both library functionality (i.e. usage as from xxx import y) and cli capabilities (i.e. usage as $ xxx)?
  • Wouldn't this need to reinstall all dependencies inside the virtualenv?

Regarding the concretisation issue: I don't think that should stop us developing a new idea (if it is useful) @scheibelp and @tgamblin should probably chime in and say if this conflicts with the new concretiser™️

On a more general note:

I mentioned this in #8360 but it also fits here. I don't think that our current implementation of the way we expose installed packages to users really fits to spack. spack's core is the DAG and intuitively I always expected a spack load call to give me an environment that is exactly like what I would get if I apt installed (or equivalent) everything that is in the DAG manually on the system.
This is relevant for this issue in particular, because in python the module approach necessarily fails (at least in our current state) for particular edge cases (namely multiple installs of the same namespace package). environments are not yet what would solve this problem, as they (as far as I know) don't yet support views.
In fact the only way I know of spack reproducing "exactly what I would get from my system" (i.e. including only one additional .../bin on my path) is with a view.

Now I don't know enough about virtualenv but they sound a bit like views for python packages (if you don't end up needing to reinstall dependencies). I'm not against this, in fact as long as it would provide a usable flake8 binary, I'm all for it! But I think that the problem you are addressing here is more fundamental than only python packages.

@hartzell
Copy link
Copy Markdown
Contributor Author

hartzell commented Jun 6, 2018

I'm pulling some quotes out of order from @healthers comments so that I can use them to support the narrative. Thanks for the thoughts!

I didn't have a chance to look at the implementation yet, so some of these questions might clear up once I read through it.

Don't dig too deeply into the implementation, it's a hack that I pulled off with blunt instruments to convince myself that it the idea might be workable.

The idea, now that might be golden. Or not....

Now I don't know enough about virtualenv but they sound a bit like views for python packages (if you don't end up needing to reinstall dependencies).

The virtualenv docs give a nice introduction, starting with:

virtualenv is a tool to create isolated Python environments.

Imagine back in the simple olden days, there was one installation of python, things are installed with/into it, and everyone sees the same everything.

If you wanted isolation, you might install another copy of python somewhere else, and when you used it's .../bin/python to install things, they'd be installed inside of it's directory tree. You could repeat this pattern ad nauseum. Back in the days of small disks (19" 300MB Fujitsu Eagles...), this would be a crazy expensive waste of disk space. These days that's chicken feed. But still, it could get out of hand.

Python virtual environments are a middle ground. They share nearly all of a single Python installation but when one virtualenv is "activated", anything installed by that virtual environment's python tool chain ends up inside that virtual environment. If you have two environments that use pygments, you'll have two copies installed. As a side effect of the way things work, any scripts installed with that Python end up in its .../bin and use the packages installed in that tree. No need to set PYTHONPATH or ...

The implementation is an elegant hack, the video Reverse-engineering Ian Bicking's brain does a marvelous job of explaining how it works (worked, Python3 is different?).

These are from what I thought while reading through you proposal without particular order:

  • Currently, we don't distinguish between python libraries (like py-numpy) and python executables (like py-flake8), how would a user know which one it is?
  • What about packages that provide both library functionality (i.e. usage as from xxx import y) and cli capabilities (i.e. usage as $ xxx)?
  • Wouldn't this need to reinstall all dependencies inside the virtualenv?

While I'd like to Fix All The Things(tm), I haven't figured out how. This proposal offers one thing, a reliable way to install python applications.

The "application-ness" of a Spack python package would be declared by the package author/maintainer, although there might be a variant knob (below). The easiest thing to do would make it either-or. Packages that provide libraries aren't applications. Packages that provide executables are applications.

Bumpversion is an application. It uses a Python package or two and is composed of it's own Python bits. It doesn't offer the end user anything beyond the "binary". It probably would have been easier if were in C/Go/..., but...

I'm not entirely sure what to do with packages that offer applications and libraries. In the model that I'm proposing, peeking inside the virtualenv would be against the rules, so no peeking!

As part of the ongoing Go dependency management/vendoring discussion, it's become clear that it's bad design for something to be both a library and an application. Their vendoring headache is similar to what we're hitting here.

But fixing the Python ecosystem might take some time, so....

  • One idea would be to install the package twice, once to get the application and once to get the libraries (disk is cheap). As long as the versions/specs line up (including underlying C libraries and ..., that could be [made to be] safe).

    I'm not sure if/when the would work and when it would fail.

  • Or, perhaps those packages would have the use the existing mechanism.

I don't think that Homebrew installs python libraries. If you're building a Python project that uses libraries, they'll leave you at the mercy of the Python (Perl, R, ...) ecosystem. They'll install "programs" though, and track the libraries they require as "resources" that are check-summed and etc... But that won't work for us, without changing Spacks modus operandi. Now that you mention it though....

And to your final point, yes, this ends up reinstalling things.

  • My ~/.emacs.d is 478MB.
  • It looks like a Spack Python installation is about 100MB.
  • It looks like the Spack Pygments is about 7MB.

Sharing leads to all kinds of complications (again, Go has a proverb: "A little copying is better than a little dependency.")

I'd happily have a dozen extra packages installed to have the flake8 command just work. It's all in the DAG and Spack can juggle it for me.

That said, if implementing this required a big, ugly hairball that complicated everything that lived in the same sub-directory, I don't think it would be worth it. But if it can leverage existing work in concretiz-er and string-valued variants, then it's just a nice bit of isolated machinery (he says, hopefully).

@healther
Copy link
Copy Markdown
Contributor

healther commented Jun 6, 2018

I'd happily have a dozen extra packages installed to have the flake8 command just work. It's all in the DAG and Spack can juggle it for me.

emphatic +100 for that. Even if it ends up only being an option for the "only applications" packages I would vote for adding this (though this isn't really democracy^^)

I still think views would be the ultimate solution™️ because they wouldn't require redundant installations, though that likely only is a problem for a limited subset of spack users. So this is a larger discussion and shouldn't take over this issue.

There is one problem that I see right now and that is what happens when I want to load flake8 AND other python libraries/binaries, which python will then be used?

I don't think that Homebrew installs python libraries

Just for reference:

$ brew search numpy
==> Searching local taps...
numpy
==> Searching taps on GitHub...
==> Searching blacklisted, migrated and deleted formulae...

Not sure how sane using this ends up being in the end though ;) (hint: I don't use it)

@hartzell
Copy link
Copy Markdown
Contributor Author

hartzell commented Jun 6, 2018

I still think views would be the ultimate solution™️ because they wouldn't require redundant installations [...] and shouldn't take over this issue.

How do views handle applications that want to different versions of a python library? I wonder how often that occurs?

There is one problem that I see right now and that is what happens when I want to load flake8 AND other python libraries/binaries, which python will then be used?

The shbang line of the flake8 script points at the python within its virtual environment and the shbang line of bumpversion points at the python within its virtual envrironment and so on. There's no interaction. Everything's nicely isolated.

They'd also be safe w.r.t. activated packages (which appear in the sub-directories of the Spack python).

I'm not sure how much harm one could do with PYTHONPATH though. It would be great it one could tell Python (Perl, R) to just ignore it (I've actually used Perl's SITECUSTOMIZE to manipulate PERL5LIB...).

I don't think that Homebrew installs python libraries

Well, learn something new every day. It looks like that ends up getting installed into the site packages directory of the python being used, so every user sees it and there's no opting out. They don't seem to install it into the Cellar as a separate entity (a la Spack).

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jun 6, 2018 via email

@healther
Copy link
Copy Markdown
Contributor

healther commented Jun 6, 2018

I'm happy with loading a bunch of modules --- which is pretty equivalent to views.

No they aren't, loading multiple py-backports-* packages don't work! The problem will always occur if something is performing a two-step search, where in the first one it tries to match a folder/filename and in the second a file/functionname.
Yes this probably indicates some design flaw in the "something" nevertheless there is one example where it breaks (i.e. py-flake8)

How do views handle applications that want to different versions of a python library? I wonder how often that occurs?

They don't! What they are essentially recreating is the "there is one site-packages/bin/lib/...-folder in the environment"-situation that you have on a system level (or in an virtual environment for that matter). The initial reason for introducing them was actually that it took forever to do tab completion when we had >150 bins on $PATH on a very slow nfs drive.
Essentially as long as you don't have things that have explicitly conflicting dependencies (like py-numpy@:1.4 and [email protected]:) that should end up in the same environment, a view would give you a very similar situation to system installs. With the option of just exchanging the views to get another environment.
The advantage of views is that you don't need to reinstall everything, and can reuse installations, i.e. the python binary for flake8 and for bumpversion could be the same for example. And you would end up with two symlinks.

The one big problem that remains is if you cross boundaries, i.e. either want to install things with spack that relies on binaries from the system (e.g. py-matplotlib by spack and py-numpy from the system). Or in the other direction combining spack activate with spack view would also potentially break stuff.

I'm not sure how much harm one could do with PYTHONPATH though. It would be great it one could tell Python (Perl, R) to just ignore it (I've actually used Perl's SITECUSTOMIZE to manipulate PERL5LIB...).

That would be the other solution, i.e. manipulating the search pattern of each package system manually in order to work around the "multiple bin/lib/..." problem. Albeit I completely failed to understand how python's searching is working...

@hartzell
Copy link
Copy Markdown
Contributor Author

hartzell commented Jun 7, 2018

I have an idea for using Python's site customization machinery to implement "rpath-ing for python packages" that could be alternative to using virtualenvs (probably with a different set of nasty smells, but we'll see). I'll prototype something an throw it out for feedback.

hartzell pushed a commit to hartzell/spack that referenced this pull request Jun 10, 2018
TL;DR, with this PR, in a clean tree, you should be able to install a
Python package and run its script from its `prefix.bin` without
setting any environment variables or ...  E.g.

```console
spack install py-flake8
(module purge; /home/hartzell/tmp/spack-rpath.py/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/py-flake8-3.5.0-4gkbvq2u3si6jxsmlapdeolds4wgwzdx/bin/flake8 --help)
```

---

This is an alternative approach to solving the problems addressed in
PR spack#8364.  Issue spack#8343 (a flake8 failure) is an example of the problem
in real life.

The key bits to this approach are Spack's DAG, Python's
`site.addsitedir` function and Python's `sitecustomize.py` file.

The implementation is a proof of concept hack, don't get too hung up
on the code itself.

The problem, in a nutshell, is adding the various directories into
which we've installed an application's python prerequisites onto its
`sys.path`.  The current approach is to either:

- `activate` the prereq's, which links them into the Python tree,
  which is on `sys.path` by default; or
- add them via `PYTHONPATH` (using modulefiles or ...).

The problem with the first approach is that only one version can be
activated at a time and everyone using that Spack tree is stuck with
it.

The problem with the second approach is that directories added via
`PYTHONPATH` are second class citizens, the directories themselves are
searched **BUT** the `*.pth` files they contain are not processed.
Lesser problems with this approach include `PYTHONPATH`'s global
nature, its availability for finger poking and the complexity of
juggling the modulefiles (e.g recursively loading prerequisites).

This solution parallels what `rpath` does for shared libraries, fixing
from whence an application loads its libraries *at build time*, not at
run time.  There are two components:

1. When Python packages are installed, they install a file containing
   the paths to all of the their python
   prerequisites (`.spack-rpaths`) within their `prefix`.

2. The Python package installs a `sitecustomize.py` script, which
   Python runs very early in the interpreter's startup.

   The `sitecustomize.py` code checks for a `.spack-rpaths` file.  If
   it finds one it uses `site.addsitedir` to add the directories it
   contains to `sys.path`.

   Directories that are added to `sys.path` via `site.addsitedir` *do*
   process `*.pth` files, so the magic they contain is invoked as
   expected.

Potentially sticky bits include:

- The biggest roadblock to this approach is that `sitecustomize.py` is
  processed *so* early that `sys.argv` has not been created yet, so
  discovering the path to the directory in which the script lives is
  magical.  This prototype grabs it from `/proc/self/cmdline`.  I've
  included an alternate solution that is either really elegant or
  too-cute-by-half (or both...), see the comments in
  `sitecustomize.py` for the gory details.
- Dealing with deployments that use the system python.  They might
  need to install our `sitecustomize.py`, they *might* be able to
  leverage *usercustomize*, or they might need to use one of the other
  techniques.
- I haven't played with Python3 yet.
- I suspect that a sufficiently determined user could break things by
  setting `PYTHONPATH`.

Beyond that, there's a bit of engineering to be done.

Something similar might be workable for Perl using its `sitecustomize`
support.  Perhaps R and ... too.
hartzell pushed a commit to hartzell/spack that referenced this pull request Jun 19, 2018
TL;DR, with this PR, in a clean tree, you should be able to install a
Python package and run its script from its `prefix.bin` without
setting any environment variables or ...  E.g.

```console
spack install py-flake8
(module purge; /home/hartzell/tmp/spack-rpath.py/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/py-flake8-3.5.0-4gkbvq2u3si6jxsmlapdeolds4wgwzdx/bin/flake8 --help)
```

---

This is an alternative approach to solving the problems addressed in
PR spack#8364.  Issue spack#8343 (a flake8 failure) is an example of the problem
in real life.

The key bits to this approach are Spack's DAG, Python's
`site.addsitedir` function and Python's `sitecustomize.py` file.

The implementation is a proof of concept hack, don't get too hung up
on the code itself.

The problem, in a nutshell, is adding the various directories into
which we've installed an application's python prerequisites onto its
`sys.path`.  The current approach is to either:

- `activate` the prereq's, which links them into the Python tree,
  which is on `sys.path` by default; or
- add them via `PYTHONPATH` (using modulefiles or ...).

The problem with the first approach is that only one version can be
activated at a time and everyone using that Spack tree is stuck with
it.

The problem with the second approach is that directories added via
`PYTHONPATH` are second class citizens, the directories themselves are
searched **BUT** the `*.pth` files they contain are not processed.
Lesser problems with this approach include `PYTHONPATH`'s global
nature, its availability for finger poking and the complexity of
juggling the modulefiles (e.g recursively loading prerequisites).

This solution parallels what `rpath` does for shared libraries, fixing
from whence an application loads its libraries *at build time*, not at
run time.  There are two components:

1. When Python packages are installed, they install a file containing
   the paths to all of the their python
   prerequisites (`.spack-rpaths`) within their `prefix`.

2. The Python package installs a `sitecustomize.py` script, which
   Python runs very early in the interpreter's startup.

   The `sitecustomize.py` code checks for a `.spack-rpaths` file.  If
   it finds one it uses `site.addsitedir` to add the directories it
   contains to `sys.path`.

   Directories that are added to `sys.path` via `site.addsitedir` *do*
   process `*.pth` files, so the magic they contain is invoked as
   expected.

Potentially sticky bits include:

- The biggest roadblock to this approach is that `sitecustomize.py` is
  processed *so* early that `sys.argv` has not been created yet, so
  discovering the path to the directory in which the script lives is
  magical.  This prototype grabs it from `/proc/self/cmdline`.  I've
  included an alternate solution that is either really elegant or
  too-cute-by-half (or both...), see the comments in
  `sitecustomize.py` for the gory details.
- Dealing with deployments that use the system python.  They might
  need to install our `sitecustomize.py`, they *might* be able to
  leverage *usercustomize*, or they might need to use one of the other
  techniques.
- I haven't played with Python3 yet.
- I suspect that a sufficiently determined user could break things by
  setting `PYTHONPATH`.

Beyond that, there's a bit of engineering to be done.

Something similar might be workable for Perl using its `sitecustomize`
support.  Perhaps R and ... too.
@adamjstewart
Copy link
Copy Markdown
Member

@hartzell is this PR still a WIP? Trying to close old stale PRs.

@cosmicexplorer
Copy link
Copy Markdown
Contributor

#20430 describes another attempt to solve this issue.

@adamjstewart
Copy link
Copy Markdown
Member

Since this PR is old and conflicts, and I never got a response from @hartzell, I'm going to close it. Feel free to reopen if it's something you still want to work on.

@hartzell
Copy link
Copy Markdown
Contributor Author

@adamjstewart -- Thanks for the mention. I'm stuck in a place where contributing is difficult. Closing this is appropriate. #20430 seems interesting....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants