Add new TensorFlow package#13112
Conversation
|
Thank you all for creating this PR! I gave it a try as soon as I saw it. Now bazel 0.19.0 and rest of the dependencies build just fine. However, I am getting the following error during tensorflow build: Wondering if there is simple fix for it. |
|
I guess the above error was reported before at #11850 |
|
Commented out some lines, and the build process went beyond the above mentioned error: However, a new error is showing up now: |
This comment has been minimized.
This comment has been minimized.
|
I'm still in the process of doing a test build, so I haven't hit your errors yet (some flaky downloader failed over night...) There is at least one additional "problematic" hack that I had to introduce in the bazel package @Sinan81 can you see if you can spin up a patch that circumvents the spack compiler wrappers for tensorflow (similarly to what I did for bazel)? In general before merging this PR we need to at least come to some decision for the following points:
I think 1-3 are the more important ones. If someone urgently needs TF he/she can use this PR and by doing so should be cautioned that weird things are done |
|
@Sinan81 sorry for the noise of the deleted comment (in case you got an email or such), our "solution" to your problem was a bit less drastic in that we left the option to say "not parallel", but if I understand @alalazo from #11850 correctly, then we should manually set the |
|
Feel free to add me as a maintainer: maintainers = ['adamjstewart']I'm more of a PyTorch person myself, but I'm maintaining all of the other ML/DL packages in Spack so I don't mind helping out with this one as well. |
|
I cleaned up the flake8 issues and the bazel callable. This version builds as is for me, please check and review |
|
I confirm that [email protected] and [email protected] build just fine! Thanks @healther I should mention though [email protected] didn't build with [email protected]_202-b08 rather it builds with [email protected]_202 |
|
While compiling 2.0.0-alpha0, I am getting the following error: |
|
We never build against openjdk, so it is perfectly possible that it only works with the Oracle one. Maybe @dodo47 knows something about the state of |
|
A month ago or so I managed to build [email protected] with [email protected] but something has changed since then and can't build with openjdk anymore. It is certainly good enough that we're able to build [email protected] one way or another. |
|
Actually, today I realized that so far i've been building tensorflow with default options, i.e. without cuda. When I try to build |
|
Is there a generic way to go about rewriting the checksums to sha256? @adamjstewart: I think except for the checksums, this is as good as I' going to clean it up on my own. I'd like to get your review, in particular regarding the |
Last time I tried, I only had to add the filter_file changes in the tensorflow package (if I remember correctly) for 2.0 to work, with Bazel 0.19. However, here https://www.tensorflow.org/install/source they state that 2.0 now builds with Bazel 0.26.1. I guess a wrong Bazel version might explain the error message? |
|
So today again I tried compiling |
|
The warning is completely normal (i.e. in all our builds it gets raised) and should not be the reason why it locks up. Unfortunately you would have to modify the bazel invocation to get it to be verbose too, i.e. |
|
@healther I am wondering if you might want to add the followings to the package file for the sake of completeness: version('2.0.0', sha256='49b5f0495cd681cbcb5296a4476853d4aea19a43bdd9f179c928a977308a0617')
version('1.15.0', sha256='a5d49c00a175a61da7431a9b289747d62339be9cf37600330ad63b611f7f5dc9')
version('1.14.0', sha256='aa2a6a1daafa3af66807cfe0bc77bfe1144a9a53df9a96bab52e3e575b3047ed')
depends_on('[email protected]', type='build', when='@1.14.0')
depends_on('[email protected]:0.26.1', type='build', when='@1.15.0,2.0.0')P.S. still working on compiling |
|
For future reference, I should also mention that I have been playing around with compiling
depends_on('[email protected]:', type=('build', 'run'), when='@1.14.0:')
|
|
So, it seems which will later on trigger automatic downloading of nccl@1 from a github. Or else we should add nccl dependency and do Furthermore, the following nccl paths in After nccl setup, there are a few more inputs need. So I also added the followings to the package file, and now it proceeds to the build step: Aside: I think it might make sense to make compute_capability a multi valued variant of the package. (or is it possible to indicate as many capabilities as we need?) Finally, so far I haven't managed to build for gcc > 5 as suggested by TF developers. |
|
@Sinan81 I updated the PR with all of your suggestions except for the last one, where I didn't understand the @adamjstewart I suggest rebasing this, adding a edit: For some reason my push failed -> will report back when I fixed that up |
|
This PR has some conflicts, probably due to the recent switch from MD5 to SHA256. You'll need to update your new versions to SHA256 as well. |
|
In compiling if I manually overwrite GCC_HOST_COMPILER_PATH="/mnt/local/spack/lib/spack/env/gcc/gcc" with P.S. @healther I agree that the package should be merged as soon as possible so that others can start contributing directly. |
|
Try setting |
Still getting same error :( |
|
Finally managed to build
To be specific, I built v1.13.1 with the following specific versions of dependencies:
|
| # only supports compute capabilities >= 3.5 | ||
| if spec.variants['cuda_arch'].value != 'none': | ||
| capabilities = ','.join('{0:.1f}'.format( | ||
| float(i) / 10.0) for i in spec.variants['cuda_arch'].value) |
There was a problem hiding this comment.
@Sinan81 I think this string formatting will be much more stable so that you don't get weird things like '6.499999999' instead of '6.5'.
There was a problem hiding this comment.
What should we put if cuda_arch=none? I'm trying to build on a local cluster. CUDA is installed, but I don't see any NVIDIA GPU on the cluster. I want to test anyway, but I don't know what my CUDA capabilities are.
| env.set('TEST_TMPDIR', tmp_path) | ||
| # TODO: Is setting this necessary? It breaks `spack build-env` | ||
| # because Bash can't find my .bashrc | ||
| env.set('HOME', tmp_path) |
There was a problem hiding this comment.
It looks like bazel has a --output_user_root option you can use to change where things are written instead of $HOME.
There was a problem hiding this comment.
See what I did to the bazel package for an example of its usage.
* use bazel commit in spack#13112, and add version 0.24.1, and corresponding cc_env patch * undo preferred java version by dodo47 * patch for v0.26 * Update install steps * Add patches for more versions * Add unit tests * Update patches for new Spack env vars * env is already defined, use spackEnv
|
Okay, I successfully installed @Sinan81 @healther how close do you think we are to merging this PR? I haven't yet managed to test Linux due to some numpy build problems. I would like to make sure that |
|
Last time I tried compiling v2, I got the error mentioned the in the ticket you opened above. I should mention though v2.1 always compiled fine. I will give v2 another try today. On a 16 physical cpu machine, build takes 2 hours as v1.14 while it takes only 0.5 hours up to v1.13. |
|
I decided to enable |
I'm a bit out of the loop of our current CI status. But Iiuc the last version that we checked out of this PR, TF didn't build. But it is likely that this was Maybe @obreitwi can summarise what the current status of our CI is? |
|
While building v2.0.0 on Linux, I am getting the following error: Time permitting I will look into it today. |
|
After clearing bazel cache, I am getting the following error: |
|
When I try to build with [email protected] (as suggested in TF docs), the error is |
|
Your first error looks like a Cython thing. You could try adding a build dependency on Your second error looks like a Bazel thing, which is far out of my area of expertise. |
|
I think I'm going to merge this PR as is, since it seems like we've gotten 2.0 to work on macOS and 1.14/1.13 to work on Linux. I think this is a pretty good start, and we can keep hacking on the package in smaller PRs to get individual versions/environments working. Thanks for the hard work everyone! |
|
@adamjstewart do you have an idea where we could visibly say: "This is an experimental package, it is likely not everything works. Please let us know and we will add the appropriate We should probably have something as a general note on the terminal in case of a build-fail, but especially for TF (if people use the package from spack) we will run into loads of different setups... edit: doing a test build either now or over the weekend and will report back |
|
@healther I don’t think there’s any way to do this right now. Hopefully the post on Slack will reach enough people |
The tensorflow package stopped building with the latest upstream merge and will hence be disabled until spack#13112 is merged. Change-Id: I713c2352119e6fd931ede64a2e5b491e7618f9ec
As @healther promised in #9352 here is the current state of our tf-package.
This is essentially a cherry-pick onto the current HEAD and we did exactly one test build which is still running right now (@healther will report whether it worked), so use with caution. For example, it explicitly breaks the air-gapped-ability of spack, as bazel insists on downloading its own stuff.
All versions I added worked at some point, but I didn't retest them. I hope it is helpful :)
Closes #9352
Closes #3244
Closes #2043