Skip to content

PythonPackage: add RPATHs, parallel builds#20765

Closed
adamjstewart wants to merge 2 commits intospack:developfrom
adamjstewart:features/build-ext
Closed

PythonPackage: add RPATHs, parallel builds#20765
adamjstewart wants to merge 2 commits intospack:developfrom
adamjstewart:features/build-ext

Conversation

@adamjstewart
Copy link
Copy Markdown
Member

This is a reboot of #4545

Some Python packages depend on non-Python dependencies. For example, py-pillow depends on several imaging libraries. When building these Python packages, Python uses whatever compiler it was built with. Since we filter the compiler wrappers out of the Python installation, this means that we are not using Spack's compiler wrappers when installing Python packages, and hence we are not RPATHing dependencies. Even worse, some packages like py-torch will use a mix of CC and the compiler Python was built with, meaning that some parts of the build can find dependencies while others can't.

Previously, we were building with:

$ python setup.py build
$ python setup.py install --prefix=...

Now, we build with:

$ python setup.py build_py
$ python setup.py build_ext --parallel=... --include-dirs=... --library-dirs=... --rpath=...
$ python setup.py build_clib
$ python setup.py build_scripts
$ python setup.py install --prefix=...

@robertu94 @mjwoods @trws @skosukhin

@adamjstewart
Copy link
Copy Markdown
Member Author

adamjstewart commented Jan 9, 2021

Problems I've encountered so far:

  1. Some packages can't be built in parallel

    This seems to be a problem particularly due to the Python 3.8 + macOS multiprocessing fiasco. Packages affected include cython ([BUG] python setup.py build_ext -j4 fails on macOS with Python 3.8+ cython/cython#3973) and possibly pandas (Parallel build_ext might fail due to file collisions pandas-dev/pandas#30873)

  2. install phase reruns build phases

    When you run python setup.py install, it actually re-runs python setup.py build. For most packages, this is a no-op, as the build has already finished. But I've encountered many packages that have been hacked to re-run build. This also re-runs build without the flags we added, undoing our work and making the build take twice as long. py-cdat-lite is an example of a package where this happens. I haven't tried this yet, but I wonder if replacing install with install_(lib|headers|scripts|data) will prevent it from re-running these steps. Another possibility is to write the config flags into setup.cfg instead of supplying them on the command line.

  3. --rpath may not even make a difference?

    With this PR, when I build py-cartopy, the libraries are correctly RPATHed as evidenced by otool -L, but the module still fails to import unless (DY)?LD_LIBRARY_PATH is set. Cartopy uses ctypes.util.find_library to locate libraries, which uses the env var instead of the RPATH.

I'm hoping that most of these problems are package-specific and infrequent enough that we can fix those libraries by upstreaming patches. Definitely needs more testing to see what other issues this could cause.

@robertu94
Copy link
Copy Markdown
Contributor

robertu94 commented Jan 9, 2021

@adamjstewart great work!

this might affect the Cartopy example, but if you look at the implementation of find__library, python doesn't seem to directly ask the loader which libraries to load, instead it uses custom logic per OS. In the case of MacOS, it first searches the "override path" which comes from DYLD_FRAMEWORK_PATH and DYLD_LIBRARY_PATH. Next it searches for libraries that have a special prefix @executable_path/ in their name and looks for them at a path relative to the executable. Finally it searches DYLD_FALLBACK_FRAMEWORK_PATH, DYLD_FALLBACK_LIBRARY_PATH, then a specific list of fall back directories: "~/Library/Frameworks", "/Library/Frameworks", "/Network/Library/Frameworks", "/System/Library/Frameworks", "~/lib", "/usr/local/lib", "/lib", "/usr/lib".

On Linux, we ask /sbin/ldconfig then the gcc first in in PATH and finally ld

@scheibelp
Copy link
Copy Markdown
Member

This seems to be a problem particularly due to the Python 3.8 + macOS multiprocessing fiasco.

"The Python 3.8 + macOS multiprocessing fiasco" makes me think of #14102, which was resolved by #18205. Are you seeing other issues with creating processes on Mac?

It sounds like the issue is with spack install -j x ... for x > 1: i.e. individual package installs cannot run in parallel. If you point out some examples I could check and see if this is isolated to Mac.

@scheibelp scheibelp self-assigned this Jan 11, 2021
@adamjstewart
Copy link
Copy Markdown
Member Author

@scheibelp cython/cython#3973 is the main multiprocessing problem I've encountered so far. Basically, Cython implements its own multiprocessing, so when you do -j4, a subprocess tries to spawn its own subprocess.

So this isn't a Spack bug. The only remaining Spack + Python 3.8 + macOS bug that I know of is #20025.

@robertu94
Copy link
Copy Markdown
Contributor

@adamjstewart using this branch py-cython, py-pandas, and py-numpy all build successfully in parallel on my linux box.

@adamjstewart
Copy link
Copy Markdown
Member Author

At this point I'm pretty confident in the parallel build logic. Cython is the only build failure I've seen so far, but we can add a workaround for that. What I'm less confident in is the RPATH logic. If that doesn't actually work, there's no point in adding this additional complexity. We can still use --parallel for the current build/install phases without changing the phases. I'm not sure how much more time I plan to devote to this PR, but it would be good to know whether the RPATH logic actually helps or not, at least for some packages.

@robertu94
Copy link
Copy Markdown
Contributor

@adamjstewart how can we test the rpath logic? I’m happy to help testing if I know how to test it. I’d never used RPATH before spack, so I don’t know how to test it at all.

@adamjstewart
Copy link
Copy Markdown
Member Author

@robertu94 In my mind, there's 3 different kinds of tests.

  1. The first test is to see whether or not the flags I'm using in build_ext are sufficient to build the package. First, make sure you remove any custom hacks I've added to packages in the past. You can see where I did this for py-matplotlib. Then, make sure the package actually builds successfully. So far py-pillow doesn't build at all without those custom hacks, so the RPATH logic doesn't seem to help with that package.
  2. The second test is whether or not the installed libraries can be imported successfully. This can be done at build-time using spack install --test=root py-matplotlib. You can also run import tests post-installation using spack test run py-matplotlib. These two sets of tests have different env vars set, so it's important to test both. For example, py-cartopy builds fine, but the import tests fail for some reason.
  3. The third test is whether or not any shared/dynamic libs can find their dependencies. On Linux, you can use ldd -r /path/to/libfoo.so, and on macOS you can use otool -L /path/to/libfoo.dylib. On macOS, you may see libraries named with .dylib or with .so, on Linux it's usually only .so.

@adamjstewart
Copy link
Copy Markdown
Member Author

Before this can be merged, I want to make sure a sizeable portion of Python packages don't break. Personally, I use the following spack.yaml:

spack:
  specs:
  - py-black
  - py-cartopy
  - py-cmocean
  - py-codecov
  - py-fiona
  - py-flake8
  - py-geopandas
  - py-geoplot
  - py-jupyterlab
  - py-matplotlib
  - py-mypy
  - py-numpy
  - py-pandas
  - py-pygeos
  - py-pyinstrument
  - py-pytest
  - py-pytest-cov
  - py-pytest-mock
  - py-pyyaml
  - py-rasterio
  - py-scikit-learn
  - py-scipy
  - py-setuptools
  - py-shapely
  - py-sphinx
  - py-sphinx-rtd-theme
  - py-sphinxcontrib-programoutput
  - py-tables
  - py-torch
  - py-torchvision
  - py-twine
  - py-vermin

Other users may have more packages they care about, but this is probably enough for testing.

@robertu94
Copy link
Copy Markdown
Contributor

Ok, I've setup a mass build and test script with this information to run. I'll let you know how it goes.

@robertu94
Copy link
Copy Markdown
Contributor

@adamjstewart Here are the high level results:

  • the build log for the new feature branch is much longer which generally doesn't bode well
  • several of the packages took longer to build with this patch such as py-fiona
  • almost all of the runtime import failures where caused by loading libgeos_c.so.
  • cartopy and geopandas did not pass runtime import tests with the feature branch or with develop
  • I believe shapely passed more a one or two more tests on the feature branch.

I'll have to dig into the results more carefully to know exactly what worked/didn't.

spack_build.tar.zip

@adamjstewart
Copy link
Copy Markdown
Member Author

Hmm, so there are some packages that this doesn't help. Are there any packages that this hurts? Are there any packages that this does help?

@robertu94
Copy link
Copy Markdown
Contributor

@adamjstewart I've had more time to read the build logs, this change breaks the build for matplotolib and py-tables and a possibly a few of their dependents; scipy's build time nearly doubled with this patch.

@adamjstewart
Copy link
Copy Markdown
Member Author

Let me investigate some of these examples a bit. In the meantime, I may open a separate PR that only includes the parallel build changes.

@adamjstewart
Copy link
Copy Markdown
Member Author

scipy's build time nearly doubled with this patch.

@robertu94 I was unable to reproduce this locally.

py-scipy build time on develop:

  Fetch: 0.11s.  Build: 19m 45.79s.  Total: 19m 45.90s.

py-scipy build time on features/build-ext:

  Fetch: 0.15s.  Build: 19m 30.79s.  Total: 19m 30.94s.

I do notice the following warning message:

NOTE: -j build option not supportd. Set NPY_NUM_BUILD_JOBS=4 for parallel build.

If I set NPY_NUM_BUILD_JOBS to str(make_jobs) (which happens to be 4 on my ancient laptop), the new build time is:

  Fetch: 0.14s.  Build: 19m 55.85s.  Total: 19m 56.00s.

@robertu94
Copy link
Copy Markdown
Contributor

@adamjstewart in my testing it was about 11m for the parallel build, and 7m40s for the serial build on a 40core node on my schools cluster. The file system is served to the nodes over NFS, so perhaps I hit several load spikes on the file system that pushed the build time higher.

@adamjstewart
Copy link
Copy Markdown
Member Author

Hmm, so even building in parallel might be problematic. Do you encounter that same issue with non-Python packages? Does limiting the parallelism to 4 help?

@robertu94
Copy link
Copy Markdown
Contributor

I do occasionally hit that with non python packages as well — it seems to be a feature of our cluster from time to time. I don’t blame spack for NFS being slow.

@mjwoods
Copy link
Copy Markdown
Contributor

mjwoods commented Feb 21, 2021

Hi all, sorry for being late to join the discussion. @adamjstewart mentioned at the start that the compiler wrappers are filtered out of python installations, and I'm wondering if that is the cause of many problems mentioned here. I can't remember the reason for filtering out the wrappers, but I'm guessing one reason is that users want to install python packages outside of spack commands. If so, would it be possible to enhance the wrappers so they behave correctly outside of spack commands?

@adamjstewart
Copy link
Copy Markdown
Member Author

I believe #21149 is designed to do just that.

@adamjstewart adamjstewart deleted the features/build-ext branch October 26, 2021 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants