Skip to content

add better support for libfabrics#6978

Closed
baberlevi wants to merge 40 commits intospack:developfrom
ResearchIT:mpifabrics
Closed

add better support for libfabrics#6978
baberlevi wants to merge 40 commits intospack:developfrom
ResearchIT:mpifabrics

Conversation

@baberlevi
Copy link
Copy Markdown
Member

This PR builds on #4168.

The idea is to have a libfabric in spack use other libraries from spack (e.g. verbs and psm2), and in turn have openmpi use libfabric instead of building directly against verbs or psm2.

The +spackfabrics variant of libfabrics is a bit clunky feeling, but allows people to do nothing & continue using system libraries. If that's not needed, I'm happy just making opa-psm2 and rdma-core default dependencies of libfabrics instead, and letting them be overridden as external packages instead as needed.

@baberlevi
Copy link
Copy Markdown
Member Author

Tagging a few people involved in the PR I built on to get your feedback: @eschnett @adamjstewart @shamisp @svenevs

@svenevs
Copy link
Copy Markdown
Member

svenevs commented Jan 22, 2018

Hey @baberlevi sorry to say, this is beyond my expertise. I don't know anything about these packages :/ I was only on the other thread because at the time I was working on #4188 which had different stuff going on with respect to libnl. I gave up on that long ago though, primarily because of the no root installation dilemma with spack (PCL dependencies that would be highly preferred require installation of e.g. driver startup files in /etc/).

patch('configure.patch', when="@1.10.1")
patch('fix_multidef_pmi_class.patch', when="@2.0.0:2.0.1")

fabrics = ('psm', 'psm2', 'pmi', 'verbs', 'mxm', 'libfabric')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add ucx as a fabrics dependency here. It is used instead of mxm.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openmpi 3 still says it supports mxm (i.e. the configure script still has the option). I'm not that familiar with mxm/ucx, so I'm hesitant to remove that option in case it's being used by someone.
Ucx support is there as a variant, though it may make more sense to be an option in the fabrics list. @junghans - looks like you added the +ucx variant. Have an opinion on that ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baberlevi - ypu are right, it supports ucx and mxm. Probably we should have both options htere. ucx is a new layer that gradually replacing mxm.

'verbs',
'usnic',
'mxm',
'mlx',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLX is how libfabrics calls UCX. So for sure the dependency has to be updated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I added the ucx dependency when mlx is enabled.

'mxm',
'mlx',
'gni',
'xpmem',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check this with project maintainers but in the git I do not see xpmem code. Maybe there are some proprietary tarballs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only xpmem code I can find is
https://code.google.com/archive/p/xpmem/downloads
https://github.com/hjelmn/xpmem
And it doesn't seem like either of these are 'official' distributions.

From reading a bit more, my understanding is that xpmem is an optional feature to enable within the gni provider, and not a fabrics provider itself ? Based on that, I have removed it from the libfabric fabrics list.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is referenced in ugni, but in this case it will come pre-installed with cray system. But as for now, I don't think it is a stand alone transport.

UCX has stanalone transport for XPMEM and I'm involved in https://github.com/hjelmn/xpmem as well.
XPMEM has both user level and kernel module components. So, I'm not sure if kernel modules are under spack scope.

"system libraries")
)

depends_on('rdma-core', when='+spackfabrics fabrics=verbs')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update ucx package as well
var/spack/repos/builtin/packages/ucx/package.py

it depends on rdma-core

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the rdma-core dependency to ucx. It was building successfully without it, but I think disabling verbs support. With the dependency added, linked libraries look good.

@eschnett
Copy link
Copy Markdown
Contributor

I like the general idea, but haven't looked at the implementation in detail.

If the goal is to use a Spack-provided libfabrics, then the fabrics keyword should do that, while a new internal-fabrics keyword would use the OpenMPI-provided variants.

To make sure the change-over is smooth, you'd need to find out (email the OpenMPI developers?) whether they have significant changes accumulated in their version of libfabrics.

@baberlevi
Copy link
Copy Markdown
Member Author

@eschnett - sorry I'm not sure what you mean by 'openmpi provided variants'. My understanding is that openmpi typically builds against system libraries for psm, libfabric, etc. but doesn't distribute their own. Unless you mean the OFED release that includes openmpi ?

@shamisp
Copy link
Copy Markdown

shamisp commented Jan 24, 2018

@baberlevi OMPI does not require libfabrics, ucx, etc. It has BTLs which are very similar to ucx, libfabrics, etc. So, you can build OpenMPI with internal + external transports. Based on transport priorities it will pick one during the runtime.

@baberlevi
Copy link
Copy Markdown
Member Author

I've tested this on hardware with Omnipath, Intel TrueScale, and Mellanox cards, and with psm, psm2, verbs, tcp - all seem to work.

I have also moved the ucx variant into the fabrics list where I think it belongs.

I also built a version on hardware with no fabrics, and compared to a hand-built version. The built-in BTL's look the same according to ompi_info.

I think this is ready to go, unless there are further changes anyone would like to see.

@alalazo @weijianwen - I see you've both recently worked on the openmpi fabrics list - perhaps you have some opinions on this PR ?

@weijianwen
Copy link
Copy Markdown
Contributor

@baberlevi This commit looks good to me. My PR intents to remvoe PMI (Process Management Interface) away from the fabric list. I plan to submit follow-up PR after yours being merged.

@weijianwen
Copy link
Copy Markdown
Contributor

@baberlevi could you filter out file changes irrelevant to openmpi in your commits? It seems augustus and other non-openmpi packages are affacted by this PR.

@adamjstewart
Copy link
Copy Markdown
Member

Ping @baberlevi

@baberlevi
Copy link
Copy Markdown
Member Author

I merged the latest devel in, and I think it looks ok now. Let me know if you still see issues. The only files changed I see are relevant to this PR now. I would also really like to see #7226 merged.

Copy link
Copy Markdown
Member

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds numactl as a dependency to several packages. However, numactl does not build for me on macOS 10.13.3 with Clang 9.0.0:

25 errors found in build log:
     28     libtoolize: copying file 'm4/ltsugar.m4'
     29     libtoolize: copying file 'm4/ltversion.m4'
     30     libtoolize: copying file 'm4/lt~obsolete.m4'
     31     autoreconf: running: /Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/autoconf-2.69-hxmoo4qjvffpyqduwjj53sdvolcpvudo/bin/au
            toconf --include=/Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/pkgconf-1.4.0-nwvvjt6hf4m3234zt4exuxgti7q4jnbq/share/aclo
            cal --force
     32     autoreconf: running: /Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/autoconf-2.69-hxmoo4qjvffpyqduwjj53sdvolcpvudo/bin/au
            toheader --include=/Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/pkgconf-1.4.0-nwvvjt6hf4m3234zt4exuxgti7q4jnbq/share/ac
            local --force
     33     autoreconf: running: automake --add-missing --copy --force-missing
  >> 34     configure.ac:13: installing 'build-aux/compile'
  >> 35     configure.ac:13: installing 'build-aux/config.guess'
  >> 36     configure.ac:13: installing 'build-aux/config.sub'
  >> 37     configure.ac:9: installing 'build-aux/install-sh'
  >> 38     configure.ac:9: installing 'build-aux/missing'
     39     Makefile.am: installing 'build-aux/depcomp'
     40     parallel-tests: installing 'build-aux/test-driver'
     41     autoreconf: Leaving directory `.'
     42     ==> Executing phase: 'configure'
     43     ==> '/Users/Adam/spack/var/spack/stage/numactl-2.0.11-guzzryiqnf2uvlrh2gx6v3qvyorjrscf/numactl-2.0.11/configure' '--prefix=/Users/Adam/spack/opt/
            spack/darwin-highsierra-x86_64/clang-9.0.0-apple/numactl-2.0.11-guzzryiqnf2uvlrh2gx6v3qvyorjrscf'
     44     checking for a BSD-compatible install... /usr/bin/install -c

     ...

     145    ==> 'make' '-j4'
     146    /Applications/Xcode.app/Contents/Developer/usr/bin/make  all-am
     147      CC       libnuma.lo
     148      CC       syscall.lo
     149      CC       affinity.lo
     150      CC       distance.lo
  >> 151    syscall.c:18:10: fatal error: 'asm/unistd.h' file not found
     152    #include <asm/unistd.h>
     153             ^~~~~~~~~~~~~~
     154    In file included from distance.c:24:
     155    ./numaint.h:53:9: warning: 'howmany' macro redefined [-Wmacro-redefined]
     156    #define howmany(x,y) (((x)+((y)-1))/(y))
     157            ^

     ...

     163    #define howmany(x,y) (((x)+((y)-1))/(y))
     164            ^
     165    /usr/include/sys/types.h:185:9: note: previous definition is here
     166    #define howmany(x, y)   __DARWIN_howmany(x, y)  /* # y's == x bits? */
     167            ^
     168    1 error generated.
  >> 169    libnuma.c:317:1: error: only weak aliases are supported on darwin
     170    make_internal_alias(numa_pagesize);
     171    ^
     172    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     173    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     174                                                                            ^
  >> 175    make[1]: *** [syscall.lo] Error 1
     176    make[1]: *** Waiting for unfinished jobs....
  >> 177    libnuma.c:670:1: error: only weak aliases are supported on darwin
     178    make_internal_alias(numa_max_node);
     179    ^
     180    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     181    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     182                                                                            ^
  >> 183    libnuma.c:691:1: error: only weak aliases are supported on darwin
     184    make_internal_alias(numa_max_possible_node_v1);
     185    ^
     186    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     187    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     188                                                                            ^
  >> 189    libnuma.c:692:1: error: only weak aliases are supported on darwin
     190    make_internal_alias(numa_max_possible_node_v2);
     191    ^
     192    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     193    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     194                                                                            ^
  >> 195    libnuma.c:780:1: error: only weak aliases are supported on darwin
     196    make_internal_alias(numa_node_size64);
     197    ^
     198    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     199    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     200                                                                            ^
  >> 201    libnuma.c:857:1: error: only weak aliases are supported on darwin
     202    make_internal_alias(numa_police_memory);
     203    ^
     204    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     205    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     206                                                                            ^
     207    libnuma.c:873:8: warning: implicit declaration of function 'mremap' is invalid in C99 [-Wimplicit-function-declaration]
     208            mem = mremap(old_addr, old_size, new_size, MREMAP_MAYMOVE);
     209                  ^
  >> 210    libnuma.c:873:45: error: use of undeclared identifier 'MREMAP_MAYMOVE'
     211            mem = mremap(old_addr, old_size, new_size, MREMAP_MAYMOVE);
     212                                                       ^
  >> 213    libnuma.c:916:1: error: only weak aliases are supported on darwin
     214    1 warning generated.
     215    make_internal_alias(numa_alloc_interleaved_subset_v1);
     216    ^
     217    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     218    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     219                                                                            ^
  >> 220    libnuma.c:917:1: error: only weak aliases are supported on darwin
     221    make_internal_alias(numa_alloc_interleaved_subset_v2);
     222    ^
     223    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     224    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     225                                                                            ^
  >> 226    libnuma.c:1051:1: error: only weak aliases are supported on darwin
     227    make_internal_alias(numa_set_membind_v2);
     228    ^
     229    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     230    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     231                                                                            ^
  >> 232    libnuma.c:1157:1: error: only weak aliases are supported on darwin
     233    make_internal_alias(numa_get_mems_allowed);
     234    ^
     235    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     236    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     237                                                                            ^
  >> 238    libnuma.c:1399:1: error: only weak aliases are supported on darwin
     239    make_internal_alias(numa_node_to_cpus_v1);
     240    ^
     241    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     242    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     243                                                                            ^
  >> 244    libnuma.c:1400:1: error: only weak aliases are supported on darwin
     245    make_internal_alias(numa_node_to_cpus_v2);
     246    ^
     247    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     248    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     249                                                                            ^
  >> 250    libnuma.c:1538:1: error: only weak aliases are supported on darwin
     251    make_internal_alias(numa_run_on_node_mask_v2);
     252    ^
     253    ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
     254    #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
     255                                                                            ^
     256    2 warnings and 14 errors generated.
  >> 257    make[1]: *** [libnuma.lo] Error 1
  >> 258    affinity.c:40:10: fatal error: 'linux/rtnetlink.h' file not found
     259    #include <linux/rtnetlink.h>
     260             ^~~~~~~~~~~~~~~~~~~
     261    1 error generated.
  >> 262    make[1]: *** [affinity.lo] Error 1
  >> 263    make: *** [all] Error 2

We either need to figure out why numactl isn't building, or figure out if it is actually needed. Either way, the default MPI provider no longer building on macOS is a no-go for this PR.


def cmake_args(self):
cmake_args = ["-DCMAKE_INSTALL_SYSCONFDIR=" +
join_path(self.spec.prefix, 'etc')]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use self.spec.prefix.etc here instead of join_path.

def check_platform(self):
if not (sys.platform.startswith('freebsd') or
sys.platform.startswith('linux')):
raise InstallError("libnl requires FreeBSD or Linux")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about macOS? It's BSD-based.

Copy link
Copy Markdown
Member Author

@baberlevi baberlevi Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed libnl won't install on OSX:
150 /Users/baber/spack/var/spack/stage/libnl-3.3.0-vaxm4jqob6rr7i2ignz7tqbkknfforql/libnl-3.3.0/include/netlink-private/netlink.h:41:10: fatal error: 'linux/types.h' file not found 151 #include <linux/types.h>

@baberlevi
Copy link
Copy Markdown
Member Author

numactl not needed on os-x, so added an exclusion for that, just like #7509

@baberlevi
Copy link
Copy Markdown
Member Author

baberlevi commented Mar 23, 2018

confirmed default install on osx working again

@baberlevi
Copy link
Copy Markdown
Member Author

Rebased to remove some extraneous commits. I think this is ready for another review & hopefully merge.

@baberlevi
Copy link
Copy Markdown
Member Author

Closing in favor of broken up PRs #7833 #7834 #7835 #7836 #7837 #7838 #7842

@baberlevi baberlevi closed this Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants