Skip to content

Add new TensorFlow package#13112

Merged
adamjstewart merged 76 commits intospack:developfrom
dodo47:PR/tensorflow
Dec 6, 2019
Merged

Add new TensorFlow package#13112
adamjstewart merged 76 commits intospack:developfrom
dodo47:PR/tensorflow

Conversation

@dodo47
Copy link
Copy Markdown
Contributor

@dodo47 dodo47 commented Oct 9, 2019

As @healther promised in #9352 here is the current state of our tf-package.

This is essentially a cherry-pick onto the current HEAD and we did exactly one test build which is still running right now (@healther will report whether it worked), so use with caution. For example, it explicitly breaks the air-gapped-ability of spack, as bazel insists on downloading its own stuff.

All versions I added worked at some point, but I didn't retest them. I hope it is helpful :)

Closes #9352
Closes #3244
Closes #2043

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 10, 2019

Thank you all for creating this PR!

I gave it a try as soon as I saw it. Now bazel 0.19.0 and rest of the dependencies build just fine. However, I am getting the following error during tensorflow build:

==> Installing tensorflow
==> Searching for binary cache of tensorflow
==> Warning: No Spack mirrors are currently configured
==> No binary for tensorflow found: installing from source
==> Error: AttributeError: module 'spack.pkg.builtin.tensorflow' has no attribute 'make_jobs'

/mnt/local/spack/var/spack/repos/builtin/packages/bazel/package.py:100, in setup_dependent_package:
         97        dependent_module = inspect.getmodule(dependent_spec.package)
         98        if not dependent_spec.package.parallel:
         99            jobs = 1
  >>    100        elif dependent_module.make_jobs:
        101            jobs = dependent_module.make_jobs
        102        module.bazel = BazelExecutable('bazel', 'build', jobs)

Wondering if there is simple fix for it.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 10, 2019

I guess the above error was reported before at #11850

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 10, 2019

Commented out some lines, and the build process went beyond the above mentioned error:

$ git diff
diff --git a/var/spack/repos/builtin/packages/bazel/package.py b/var/spack/repos/builtin/packages/bazel/package.py
index c2e79af..2e4ff18 100644
--- a/var/spack/repos/builtin/packages/bazel/package.py
+++ b/var/spack/repos/builtin/packages/bazel/package.py
@@ -95,8 +95,9 @@ class Bazel(Package):
 
         jobs = cpu_count()
         dependent_module = inspect.getmodule(dependent_spec.package)
-        if not dependent_spec.package.parallel:
-            jobs = 1
-        elif dependent_module.make_jobs:
-            jobs = dependent_module.make_jobs
+#        if not dependent_spec.package.parallel:
+#            jobs = 1
+#        elif dependent_module.make_jobs:
+#            jobs = dependent_module.make_jobs
+#        module.bazel = BazelExecutable('bazel', 'build')
         module.bazel = BazelExecutable('bazel', 'build', jobs)

However, a new error is showing up now:

...
     56    INFO: Analysed target //tensorflow/tools/pip_package:build_pip_package (34
           0 packages loaded, 15095 targets configured).
     57    INFO: Found 1 target...
     58    [0 / 14] [-----] BazelWorkspaceStatusAction stable-status.txt ... (2 actio
           ns, 0 running)
     59    ERROR: /tmp/spack/tf/_bazel_sbulut/2289c2ebad06feb964ca6be8c038c7d3/extern
           al/com_google_absl/absl/numeric/BUILD.bazel:25:1: C++ compilation of rule 
           '@com_google_absl//absl/numeric:int128' failed (Exit 1)
     60    Spack compiler must be run from Spack! Input 'SPACK_TARGET_ARGS' is missin
           g.
     61    Target //tensorflow/tools/pip_package:build_pip_package failed to build

See build log for details:
  /cache/sbulut/spack-stage/tensorflow-1.13.1-yre37hoxr22737q3tfe5g35kbkns2jro/spack-build-out.txt

@adamjstewart

This comment has been minimized.

@healther
Copy link
Copy Markdown
Contributor

I'm still in the process of doing a test build, so I haven't hit your errors yet (some flaky downloader failed over night...)

There is at least one additional "problematic" hack that I had to introduce in the bazel package
(which from a curiously glance is the same problem that @Sinan81 found within the tensorflow build). @adamjstewart do you know the inner workings of the compiler wrappers (or who knows these magic things :)).
This behaviour must have been introduced in the last 2-3 months as our old merge works without that bypass, but it feels like this should hit everything that is build with bazel (including bazel itself)

@Sinan81 can you see if you can spin up a patch that circumvents the spack compiler wrappers for tensorflow (similarly to what I did for bazel)?

In general before merging this PR we need to at least come to some decision for the following points:

  1. How do we communicate to users, that tensorflow needs to be build with internet access?
  2. How do we communicate that this means that the same hash is not guaranteed to provide the same tensorflow?
  3. How do we deal with the "bazel breaks the compiler wrapper" thing? Is my workaround good enough? Should this be considered a bug in spack?
  4. Decide what to do about the history of this package, I'm not sure whether old versions still build
  5. Fix flake8 errors

I think 1-3 are the more important ones. If someone urgently needs TF he/she can use this PR and by doing so should be cautioned that weird things are done

@healther
Copy link
Copy Markdown
Contributor

@Sinan81 sorry for the noise of the deleted comment (in case you got an email or such), our "solution" to your problem was a bit less drastic in that we left the option to say "not parallel", but if I understand @alalazo from #11850 correctly, then we should manually set the '--jobs={}'.format(make_jobs) argument in the bazel() call of the tensorflow package. I'm still trying to get one build to succeed though

@adamjstewart
Copy link
Copy Markdown
Member

Feel free to add me as a maintainer:

maintainers = ['adamjstewart']

I'm more of a PyTorch person myself, but I'm maintaining all of the other ML/DL packages in Spack so I don't mind helping out with this one as well.

@healther
Copy link
Copy Markdown
Contributor

I cleaned up the flake8 issues and the bazel callable. This version builds as is for me, please check and review

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 11, 2019

I confirm that [email protected] and [email protected] build just fine! Thanks @healther I should mention though [email protected] didn't build with [email protected]_202-b08 rather it builds with [email protected]_202

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 11, 2019

While compiling 2.0.0-alpha0, I am getting the following error:

     30    $TEST_TMPDIR defined: output root default is '/tmp/spack/tf' and max_idle_secs default is '15'.
     31    Starting local Bazel server and connecting to it...
     32    Loading:
     33    Loading: 0 packages loaded
  >> 34    ERROR: error loading package '': Encountered error while reading extension file 'swift/repositories.bzl': no such package '@@build_bazel_rules_swif
           t//swift': Traceback (most recent call last):
     35         File "/tmp/spack/tf/_bazel_sbulut/cfafda2acdfc5895b3378343fe1e417e/external/bazel_tools/tools/build_defs/repo/git.bzl", line 166
     36                 _clone_or_update(ctx)
     37         File "/tmp/spack/tf/_bazel_sbulut/cfafda2acdfc5895b3378343fe1e417e/external/bazel_tools/tools/build_defs/repo/git.bzl", line 72, in _clone_or_upd
           ate
     38                 fail(("error cloning %s:\n%s" % (ctx....)))
     39    error cloning build_bazel_rules_swift:
     40    + cd /tmp/spack/tf/_bazel_sbulut/cfafda2acdfc5895b3378343fe1e417e/external

     ...

@healther
Copy link
Copy Markdown
Contributor

We never build against openjdk, so it is perfectly possible that it only works with the Oracle one. Maybe @dodo47 knows something about the state of @2:, but since we have a preferred statement on the 1.something I would guess that it didn't work and we said: good enough

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 12, 2019

A month ago or so I managed to build [email protected] with [email protected] but something has changed since then and can't build with openjdk anymore. It is certainly good enough that we're able to build [email protected] one way or another.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 14, 2019

Actually, today I realized that so far i've been building tensorflow with default options, i.e. without cuda. When I try to build tensorflow+cuda, it gets stuck at the install phase. I will further look into this on Tuesday.

@healther
Copy link
Copy Markdown
Contributor

healther commented Oct 14, 2019

Is there a generic way to go about rewriting the checksums to sha256?

@adamjstewart: I think except for the checksums, this is as good as I' going to clean it up on my own. I'd like to get your review, in particular regarding the env['CC']=env['SPACK_CC'] hack that works around the compiler wrappers being wrongly used by bazel

@healther healther requested a review from adamjstewart October 14, 2019 11:01
@dodo47
Copy link
Copy Markdown
Contributor Author

dodo47 commented Oct 14, 2019

Maybe @dodo47 knows something about the state of @2:, but since we have a preferred statement on the 1.something I would guess that it didn't work and we said: good enough

Last time I tried, I only had to add the filter_file changes in the tensorflow package (if I remember correctly) for 2.0 to work, with Bazel 0.19. However, here https://www.tensorflow.org/install/source they state that 2.0 now builds with Bazel 0.26.1. I guess a wrong Bazel version might explain the error message?

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 19, 2019

So today again I tried compiling tensorflow+cuda, this time on a different machine with multiple gpus as opposed to my workstation, to see if it makes difference, still the build gets stuck at install phase as follows. I am wondering if others also experienced same issue.

sbulut@compute-004 ~ 
$ spack -d install -v tensorflow+cuda ^[email protected]_202                                                                                                                         
==> [2019-10-18-18:18:46.142488] Reading config file /disk/software/lib/hpcpm/spdev/etc/spack/defaults/modules.yaml
...
==> [2019-10-18-18:19:19.164106] WRITE LOCK: /disk/software/lib/hpcpm/spdev/opt/spack/.spack-db/prefix_lock[104350938852925909:1] [Acquired]
==> [2019-10-18-18:19:19.168487] Executing phase: 'install'
==> [2019-10-18-18:19:19.178719] FILTER FILE: configure.py [replacing "if workspace_has_any_android_rule\(\)"]
==> [2019-10-18-18:19:19.184351] './configure'
$TEST_TMPDIR defined: output root default is '/tmp/spack/tf' and max_idle_secs default is '15'.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".

@healther
Copy link
Copy Markdown
Contributor

The warning is completely normal (i.e. in all our builds it gets raised) and should not be the reason why it locks up. Unfortunately you would have to modify the bazel invocation to get it to be verbose too, i.e. spack install -v does not produce additional output for bazel runs. I seem to remember (hazily) deadlocks when invoking bazel due to the build directory being located on a network drive (the reason why we put the /tmp/... as a hardcoded value)

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 20, 2019

@healther I am wondering if you might want to add the followings to the package file for the sake of completeness:

version('2.0.0',  sha256='49b5f0495cd681cbcb5296a4476853d4aea19a43bdd9f179c928a977308a0617')
version('1.15.0',  sha256='a5d49c00a175a61da7431a9b289747d62339be9cf37600330ad63b611f7f5dc9')
version('1.14.0',  sha256='aa2a6a1daafa3af66807cfe0bc77bfe1144a9a53df9a96bab52e3e575b3047ed')

depends_on('[email protected]',        type='build', when='@1.14.0')
depends_on('[email protected]:0.26.1', type='build', when='@1.15.0,2.0.0')

P.S. still working on compiling +cuda variant. Will let you know when I get more information about the build problem.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 20, 2019

For future reference, I should also mention that I have been playing around with compiling tensorflow~cuda versions 1.14.0, 1.15.0, and 2.0.0 using externally installed [email protected], and I've found out the followings:

  • versions 1.14.0, 1.15.0, and 2.0.0 depend on py-future package, otherwise build fails with ImportError: No module named builtins. Specifically, I added the following line to the package file to take care of this error. Tested this only for [email protected] so far.
depends_on('[email protected]:',   type=('build', 'run'), when='@1.14.0:')
diff --git a/WORKSPACE b/WORKSPACE
index 74ea14d..0b09a6e 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -34,6 +34,13 @@ load(
 
 bazel_toolchains_repositories()
 
+http_archive(
+    name = "io_bazel_rules_docker",
+    sha256 = "413bb1ec0895a8d3249a01edf24b82fd06af3c8633c9fb833a0cb1d4b234d46d",
+    strip_prefix = "rules_docker-0.12.0",
+    urls = ["https://github.com/bazelbuild/rules_docker/releases/download/v0.12.0/rules_docker-v0.12.0.tar.gz"],
+)
+
 load(
     "@io_bazel_rules_docker//repositories:repositories.bzl",
     container_repositories = "repositories",

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 20, 2019

So, it seems tensorflow+cuda configuration was getting stuck at nccl step. A quick fix for it is to set:

            env['TF_NCCL_VERSION'] = "1"

which will later on trigger automatic downloading of nccl@1 from a github.

Or else we should add nccl dependency and do

...
    depends_on('nccl', when='+cuda')
...
            env['TF_NCCL_VERSION'] = str(spec['nccl'].version)

Furthermore, the following nccl paths in .tf_configure.bazelrc need to be corrected after configure step via file filtering since configure script doesn't extract these paths from environment variables:

TF_NCCL_INSTALL_PATH = str(spec['nccl'].prefix)
TF_NCCL_HDR_PATH = str(spec['nccl'].prefix.include)

After nccl setup, there are a few more inputs need. So I also added the followings to the package file, and now it proceeds to the build step:

            env['TF_CUDA_CLANG'] = '0'
            env['TF_NEED_ROCM'] = '0'
            env['TF_NEED_TENSORRT'] = '0'

Aside: I think it might make sense to make compute_capability a multi valued variant of the package. (or is it possible to indicate as many capabilities as we need?)

Finally, so far I haven't managed to build +cuda variant. Speaking of which, it might be necessary to add '--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"' to the build command

            bazel('--jobs={}'.format(make_jobs), '-c', 'opt', '--config=cuda', '--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"', '//tensorflow/tools/pip_package:build_pip_package') 

for gcc > 5 as suggested by TF developers.

@healther
Copy link
Copy Markdown
Contributor

healther commented Oct 21, 2019

@Sinan81 I updated the PR with all of your suggestions except for the last one, where I didn't understand the .tf_configure.bazelrc reference. Truth be told: I'd prefer to merge this and have you do the changes directly.

@adamjstewart I suggest rebasing this, adding a conflicts('+cuda') with a reference to this discussion and merging.

edit: For some reason my push failed -> will report back when I fixed that up

@adamjstewart
Copy link
Copy Markdown
Member

This PR has some conflicts, probably due to the recent switch from MD5 to SHA256. You'll need to update your new versions to SHA256 as well.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 25, 2019

In compiling +cuda variant, I am getting the following error during build step:

...
     63    ERROR: /tmp/spack/tf/_bazel_sbulut/cd57fa472520cfea2b78ad111fb0dd88/external/fft2d/BUILD.bazel:26:1: C++ compilation of rule '@fft2d//:fft2d' failed (Exit 1)
     64    Spack compiler must be run from Spack! Input 'SPACK_TARGET_ARGS' is missing.
     65    Target //tensorflow/tools/pip_package:build_pip_package failed to build

if I manually overwrite GCC_HOST_COMPILER_PATH="/mnt/local/spack/lib/spack/env/gcc/gcc" with GCC_HOST_COMPILER_PATH="/path/to/actual/gcc/bin/gcc", the build seems to proceed further ahead. I am wondering if any of you is familiar with this error.

P.S. @healther I agree that the package should be merged as soon as possible so that others can start contributing directly.

@healther
Copy link
Copy Markdown
Contributor

Try setting env['CC'] = env['SPACK_CC'] and env['CC'] = env['SPACK_CXX'], this is due to bazel not being correctly patched and when bazel is invoked the SPACK_* variables aren't set, so the compiler wrappers aren't working

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 26, 2019

Try setting env['CC'] = env['SPACK_CC'] and env['CXX'] = env['SPACK_CXX'], this is due to bazel not being correctly patched and when bazel is invoked the SPACK_* variables aren't set, so the compiler wrappers aren't working

Still getting same error :(

    65    INFO: Analysed target //tensorflow/tools/pip_package:build_pip_pack
           age (345 packages loaded, 15553 targets configured).
     66    INFO: Found 1 target...
     67    [0 / 10] [-----] BazelWorkspaceStatusAction stable-status.txt ... (
           2 actions, 0 running)
     68    ERROR: /mnt/local/cache/sbulut/spack-stage/tensorflow-1.13.1-xbfnox
           qivyjwnjgbjvbjfrhif3rxyhgh/spack-src/tensorflow/core/BUILD:310:1: C
           ++ compilation of rule '//tensorflow/core:platform_base' failed (Ex
           it 1)
     69    Spack compiler must be run from Spack! Input 'SPACK_TARGET_ARGS' is
            missing.
     70    Target //tensorflow/tools/pip_package:build_pip_package failed to b
           uild

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Oct 30, 2019

Finally managed to build +cuda variant using spack! There were three key steps in building +cuda variant:

  1. set nccl version, and fix nccl paths after config step
  2. set LD_LIBRARY_PATH in .tf_configure.bazelrc (so that bazel can detect libcudnn)
  3. use self.compiler.cc as GCC_HOST_COMPILER as opposed to SPACK_CC

To be specific, I built v1.13.1 with the following specific versions of dependencies:

# only supports compute capabilities >= 3.5
if spec.variants['cuda_arch'].value != 'none':
capabilities = ','.join('{0:.1f}'.format(
float(i) / 10.0) for i in spec.variants['cuda_arch'].value)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sinan81 I think this string formatting will be much more stable so that you don't get weird things like '6.499999999' instead of '6.5'.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we put if cuda_arch=none? I'm trying to build on a local cluster. CUDA is installed, but I don't see any NVIDIA GPU on the cluster. I want to test anyway, but I don't know what my CUDA capabilities are.

env.set('TEST_TMPDIR', tmp_path)
# TODO: Is setting this necessary? It breaks `spack build-env`
# because Bash can't find my .bashrc
env.set('HOME', tmp_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like bazel has a --output_user_root option you can use to change where things are written instead of $HOME.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what I did to the bazel package for an example of its usage.

Sinan81 added a commit to Sinan81/spack that referenced this pull request Dec 5, 2019
* use bazel commit in spack#13112, and add version 0.24.1, and corresponding cc_env patch

* undo preferred java version by dodo47

* patch for v0.26

* Update install steps

* Add patches for more versions

* Add unit tests

* Update patches for new Spack env vars

* env is already defined, use spackEnv
@adamjstewart
Copy link
Copy Markdown
Member

Okay, I successfully installed [email protected]~cuda~nccl+gcp %[email protected] platform=darwin! Unfortunately it takes 6.5 hrs to install, at least on my circa-2013 MBP. Can't wait for binary caches to become available for everything.

@Sinan81 @healther how close do you think we are to merging this PR? I haven't yet managed to test Linux due to some numpy build problems. I would like to make sure that [email protected]+cuda+nccl %gcc platform=linux builds before merging. Also, if anyone can figure out my ~gcp build problems let me know, otherwise I might open an issue with the TF devs.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Dec 5, 2019

Last time I tried compiling v2, I got the error mentioned the in the ticket you opened above. I should mention though v2.1 always compiled fine. I will give v2 another try today. On a 16 physical cpu machine, build takes 2 hours as v1.14 while it takes only 0.5 hours up to v1.13.
I do think that we're very close to merging this PR!

@adamjstewart
Copy link
Copy Markdown
Member

I decided to enable +gcp by default until we figure out why it isn't working.

@healther
Copy link
Copy Markdown
Contributor

healther commented Dec 5, 2019

how close do you think we are to merging this PR?

I'm a bit out of the loop of our current CI status. But Iiuc the last version that we checked out of this PR, TF didn't build. But it is likely that this was 2.0, I can do a test build on our system tomorrow. In general I think merging this should be a priority, even if we have to paper over some cases with conflicts statements in the near future (as we become aware of other people's build problems). But at least we will get that feedback.

Maybe @obreitwi can summarise what the current status of our CI is?

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Dec 5, 2019

While building v2.0.0 on Linux, I am getting the following error:

WARNING: /mnt/local/cache/sbulut/spack-stage/spack-stage-py-tensorflow-2.0.0-ntp6zs7vrfmej3zygnanatx7xmmgwimt/spack-src/tensorflow/python/BUILD:3693:1: in py_library rule //tensorflow/python:standard_ops: target '//tensorflow/python:standard_ops' depends on deprecated target '//tensorflow/python/ops/distributions:distributions': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.distributions will not receive new features, and will be removed by early 2019. You should update all usage of `tf.distributions` to `tfp.distributions`.
WARNING: /mnt/local/cache/sbulut/spack-stage/spack-stage-py-tensorflow-2.0.0-ntp6zs7vrfmej3zygnanatx7xmmgwimt/spack-src/tensorflow/python/BUILD:86:1: in py_library rule //tensorflow/python:no_contrib: target '//tensorflow/python:no_contrib' depends on deprecated target '//tensorflow/python/ops/distributions:distributions': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.distributions will not receive new features, and will be removed by early 2019. You should update all usage of `tf.distributions` to `tfp.distributions`.
INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (275 packages loaded, 20934 targets configured).
INFO: Found 1 target...
[0 / 23] [Prepa] Creating source manifest for //tensorflow/tools/compatibility:tf_upgrade_v2 ... (8 actions, 0 running)
ERROR: /tmp/spack/tf/_bazel_sbulut/fd9e55f784f5ddec7f40495f8db3ac70/external/com_google_protobuf/BUILD:687:1: undeclared inclusion(s) in rule '@com_google_protobuf//:python/google/protobuf/internal/_api_implementation.so':
this rule is missing dependency declarations for the following files included by 'external/com_google_protobuf/python/google/protobuf/internal/api_implementation.cc':
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/Python.h'
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/patchlevel.h'
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/pyconfig.h'
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/pymacconfig.h'
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/pyport.h'

Time permitting I will look into it today.

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Dec 5, 2019

After clearing bazel cache, I am getting the following error:

INFO: Found 1 target...
[0 / 13] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (6 actions, 0 running)
ERROR: /mnt/local/cache/spack/tf/_bazel_sbulut/fd9e55f784f5ddec7f40495f8db3ac70/external/nasm/BUILD.bazel:8:1: C++ compilation of rule '@nasm//:nasm' failed (Exit 1)
In file included from external/nasm/output/outaout.c:52:0:
/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/eval.h:10:12: error: unknown type name 'PyObject'
 PyAPI_FUNC(PyObject *) PyEval_EvalCode(PyObject *, PyObject *, PyObject *);
            ^~~~~~~~
/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/eval.h:12:12: error: unknown type name 'PyObject'
 PyAPI_FUNC(PyObject *) PyEval_EvalCodeEx(PyObject *co,
            ^~~~~~~~
/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/eval.h:21:12: error: unknown type name 'PyObject'
 PyAPI_FUNC(PyObject *) _PyEval_EvalCodeWithName(
            ^~~~~~~~
/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/python-3.7.4-g2foj2ak4eyf7m4gt6lbu6q7xslvu74v/include/python3.7m/eval.h:31:12: error: unknown type name 'PyObject'
 PyAPI_FUNC(PyObject *) _PyEval_CallTracing(PyObject *func, PyObject *args);
            ^~~~~~~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 16.158s, Critical Path: 0.55s
INFO: 13 processes: 13 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully

@Sinan81
Copy link
Copy Markdown
Contributor

Sinan81 commented Dec 6, 2019

When I try to build with [email protected] (as suggested in TF docs), the error is

INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (275 packages loaded, 20934 targets configured).
INFO: Found 1 target...
[0 / 16] [Prepa] Expanding template tensorflow/python/tools/freeze_graph ... (2 actions, 0 running)
ERROR: /mnt/local/cache/spack/tf/_bazel_sbulut/7bd192b021ed9cc1727ca7bd9e51e3e6/external/com_google_protobuf/BUILD:106:1: undeclared inclusion(s) in rule '@com_google_protobuf//:protobuf_lite':
this rule is missing dependency declarations for the following files included by 'external/com_google_protobuf/src/google/protobuf/stubs/int128.cc':
  '/disk/software/lib/hpcpm/sptfdev/opt/spack/linux-centos7-broadwell/gcc-7.4.0/gettext-0.20.1-kywjimcikgjvmssndclpfpiiax6fcvgm/include/libintl.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@adamjstewart
Copy link
Copy Markdown
Member

Your first error looks like a Cython thing. You could try adding a build dependency on py-cython.

Your second error looks like a Bazel thing, which is far out of my area of expertise.

@adamjstewart
Copy link
Copy Markdown
Member

I think I'm going to merge this PR as is, since it seems like we've gotten 2.0 to work on macOS and 1.14/1.13 to work on Linux. I think this is a pretty good start, and we can keep hacking on the package in smaller PRs to get individual versions/environments working. Thanks for the hard work everyone!

@adamjstewart adamjstewart merged commit c3eafde into spack:develop Dec 6, 2019
@adamjstewart adamjstewart deleted the PR/tensorflow branch December 6, 2019 00:48
@healther
Copy link
Copy Markdown
Contributor

healther commented Dec 6, 2019

@adamjstewart do you have an idea where we could visibly say: "This is an experimental package, it is likely not everything works. Please let us know and we will add the appropriate conflicts()"?

We should probably have something as a general note on the terminal in case of a build-fail, but especially for TF (if people use the package from spack) we will run into loads of different setups...

edit: doing a test build either now or over the weekend and will report back

@adamjstewart
Copy link
Copy Markdown
Member

@healther I don’t think there’s any way to do this right now. Hopefully the post on Slack will reach enough people

muffgaga pushed a commit to electronicvisions/spack that referenced this pull request Dec 11, 2019
The tensorflow package stopped building with the latest upstream merge
and will hence be disabled until
spack#13112 is merged.

Change-Id: I713c2352119e6fd931ede64a2e5b491e7618f9ec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

What is the status of tensorflow on spack?

7 participants