add better support for libfabrics#6978
Conversation
|
Tagging a few people involved in the PR I built on to get your feedback: @eschnett @adamjstewart @shamisp @svenevs |
|
Hey @baberlevi sorry to say, this is beyond my expertise. I don't know anything about these packages :/ I was only on the other thread because at the time I was working on #4188 which had different stuff going on with respect to libnl. I gave up on that long ago though, primarily because of the no |
| patch('configure.patch', when="@1.10.1") | ||
| patch('fix_multidef_pmi_class.patch', when="@2.0.0:2.0.1") | ||
|
|
||
| fabrics = ('psm', 'psm2', 'pmi', 'verbs', 'mxm', 'libfabric') |
There was a problem hiding this comment.
I would add ucx as a fabrics dependency here. It is used instead of mxm.
There was a problem hiding this comment.
openmpi 3 still says it supports mxm (i.e. the configure script still has the option). I'm not that familiar with mxm/ucx, so I'm hesitant to remove that option in case it's being used by someone.
Ucx support is there as a variant, though it may make more sense to be an option in the fabrics list. @junghans - looks like you added the +ucx variant. Have an opinion on that ?
There was a problem hiding this comment.
@baberlevi - ypu are right, it supports ucx and mxm. Probably we should have both options htere. ucx is a new layer that gradually replacing mxm.
| 'verbs', | ||
| 'usnic', | ||
| 'mxm', | ||
| 'mlx', |
There was a problem hiding this comment.
MLX is how libfabrics calls UCX. So for sure the dependency has to be updated.
There was a problem hiding this comment.
Thanks - I added the ucx dependency when mlx is enabled.
| 'mxm', | ||
| 'mlx', | ||
| 'gni', | ||
| 'xpmem', |
There was a problem hiding this comment.
I would check this with project maintainers but in the git I do not see xpmem code. Maybe there are some proprietary tarballs
There was a problem hiding this comment.
The only xpmem code I can find is
https://code.google.com/archive/p/xpmem/downloads
https://github.com/hjelmn/xpmem
And it doesn't seem like either of these are 'official' distributions.
From reading a bit more, my understanding is that xpmem is an optional feature to enable within the gni provider, and not a fabrics provider itself ? Based on that, I have removed it from the libfabric fabrics list.
There was a problem hiding this comment.
Yes, it is referenced in ugni, but in this case it will come pre-installed with cray system. But as for now, I don't think it is a stand alone transport.
UCX has stanalone transport for XPMEM and I'm involved in https://github.com/hjelmn/xpmem as well.
XPMEM has both user level and kernel module components. So, I'm not sure if kernel modules are under spack scope.
| "system libraries") | ||
| ) | ||
|
|
||
| depends_on('rdma-core', when='+spackfabrics fabrics=verbs') |
There was a problem hiding this comment.
please update ucx package as well
var/spack/repos/builtin/packages/ucx/package.py
it depends on rdma-core
There was a problem hiding this comment.
I added the rdma-core dependency to ucx. It was building successfully without it, but I think disabling verbs support. With the dependency added, linked libraries look good.
|
I like the general idea, but haven't looked at the implementation in detail. If the goal is to use a Spack-provided libfabrics, then the To make sure the change-over is smooth, you'd need to find out (email the OpenMPI developers?) whether they have significant changes accumulated in their version of libfabrics. |
|
@eschnett - sorry I'm not sure what you mean by 'openmpi provided variants'. My understanding is that openmpi typically builds against system libraries for psm, libfabric, etc. but doesn't distribute their own. Unless you mean the OFED release that includes openmpi ? |
|
@baberlevi OMPI does not require libfabrics, ucx, etc. It has BTLs which are very similar to ucx, libfabrics, etc. So, you can build OpenMPI with internal + external transports. Based on transport priorities it will pick one during the runtime. |
|
I've tested this on hardware with Omnipath, Intel TrueScale, and Mellanox cards, and with psm, psm2, verbs, tcp - all seem to work. I have also moved the ucx variant into the fabrics list where I think it belongs. I also built a version on hardware with no fabrics, and compared to a hand-built version. The built-in BTL's look the same according to ompi_info. I think this is ready to go, unless there are further changes anyone would like to see. @alalazo @weijianwen - I see you've both recently worked on the openmpi fabrics list - perhaps you have some opinions on this PR ? |
|
@baberlevi This commit looks good to me. My PR intents to remvoe PMI (Process Management Interface) away from the fabric list. I plan to submit follow-up PR after yours being merged. |
|
@baberlevi could you filter out file changes irrelevant to openmpi in your commits? It seems |
|
Ping @baberlevi |
|
I merged the latest devel in, and I think it looks ok now. Let me know if you still see issues. The only files changed I see are relevant to this PR now. I would also really like to see #7226 merged. |
adamjstewart
left a comment
There was a problem hiding this comment.
This PR adds numactl as a dependency to several packages. However, numactl does not build for me on macOS 10.13.3 with Clang 9.0.0:
25 errors found in build log:
28 libtoolize: copying file 'm4/ltsugar.m4'
29 libtoolize: copying file 'm4/ltversion.m4'
30 libtoolize: copying file 'm4/lt~obsolete.m4'
31 autoreconf: running: /Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/autoconf-2.69-hxmoo4qjvffpyqduwjj53sdvolcpvudo/bin/au
toconf --include=/Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/pkgconf-1.4.0-nwvvjt6hf4m3234zt4exuxgti7q4jnbq/share/aclo
cal --force
32 autoreconf: running: /Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/autoconf-2.69-hxmoo4qjvffpyqduwjj53sdvolcpvudo/bin/au
toheader --include=/Users/Adam/spack/opt/spack/darwin-highsierra-x86_64/clang-9.0.0-apple/pkgconf-1.4.0-nwvvjt6hf4m3234zt4exuxgti7q4jnbq/share/ac
local --force
33 autoreconf: running: automake --add-missing --copy --force-missing
>> 34 configure.ac:13: installing 'build-aux/compile'
>> 35 configure.ac:13: installing 'build-aux/config.guess'
>> 36 configure.ac:13: installing 'build-aux/config.sub'
>> 37 configure.ac:9: installing 'build-aux/install-sh'
>> 38 configure.ac:9: installing 'build-aux/missing'
39 Makefile.am: installing 'build-aux/depcomp'
40 parallel-tests: installing 'build-aux/test-driver'
41 autoreconf: Leaving directory `.'
42 ==> Executing phase: 'configure'
43 ==> '/Users/Adam/spack/var/spack/stage/numactl-2.0.11-guzzryiqnf2uvlrh2gx6v3qvyorjrscf/numactl-2.0.11/configure' '--prefix=/Users/Adam/spack/opt/
spack/darwin-highsierra-x86_64/clang-9.0.0-apple/numactl-2.0.11-guzzryiqnf2uvlrh2gx6v3qvyorjrscf'
44 checking for a BSD-compatible install... /usr/bin/install -c
...
145 ==> 'make' '-j4'
146 /Applications/Xcode.app/Contents/Developer/usr/bin/make all-am
147 CC libnuma.lo
148 CC syscall.lo
149 CC affinity.lo
150 CC distance.lo
>> 151 syscall.c:18:10: fatal error: 'asm/unistd.h' file not found
152 #include <asm/unistd.h>
153 ^~~~~~~~~~~~~~
154 In file included from distance.c:24:
155 ./numaint.h:53:9: warning: 'howmany' macro redefined [-Wmacro-redefined]
156 #define howmany(x,y) (((x)+((y)-1))/(y))
157 ^
...
163 #define howmany(x,y) (((x)+((y)-1))/(y))
164 ^
165 /usr/include/sys/types.h:185:9: note: previous definition is here
166 #define howmany(x, y) __DARWIN_howmany(x, y) /* # y's == x bits? */
167 ^
168 1 error generated.
>> 169 libnuma.c:317:1: error: only weak aliases are supported on darwin
170 make_internal_alias(numa_pagesize);
171 ^
172 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
173 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
174 ^
>> 175 make[1]: *** [syscall.lo] Error 1
176 make[1]: *** Waiting for unfinished jobs....
>> 177 libnuma.c:670:1: error: only weak aliases are supported on darwin
178 make_internal_alias(numa_max_node);
179 ^
180 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
181 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
182 ^
>> 183 libnuma.c:691:1: error: only weak aliases are supported on darwin
184 make_internal_alias(numa_max_possible_node_v1);
185 ^
186 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
187 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
188 ^
>> 189 libnuma.c:692:1: error: only weak aliases are supported on darwin
190 make_internal_alias(numa_max_possible_node_v2);
191 ^
192 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
193 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
194 ^
>> 195 libnuma.c:780:1: error: only weak aliases are supported on darwin
196 make_internal_alias(numa_node_size64);
197 ^
198 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
199 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
200 ^
>> 201 libnuma.c:857:1: error: only weak aliases are supported on darwin
202 make_internal_alias(numa_police_memory);
203 ^
204 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
205 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
206 ^
207 libnuma.c:873:8: warning: implicit declaration of function 'mremap' is invalid in C99 [-Wimplicit-function-declaration]
208 mem = mremap(old_addr, old_size, new_size, MREMAP_MAYMOVE);
209 ^
>> 210 libnuma.c:873:45: error: use of undeclared identifier 'MREMAP_MAYMOVE'
211 mem = mremap(old_addr, old_size, new_size, MREMAP_MAYMOVE);
212 ^
>> 213 libnuma.c:916:1: error: only weak aliases are supported on darwin
214 1 warning generated.
215 make_internal_alias(numa_alloc_interleaved_subset_v1);
216 ^
217 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
218 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
219 ^
>> 220 libnuma.c:917:1: error: only weak aliases are supported on darwin
221 make_internal_alias(numa_alloc_interleaved_subset_v2);
222 ^
223 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
224 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
225 ^
>> 226 libnuma.c:1051:1: error: only weak aliases are supported on darwin
227 make_internal_alias(numa_set_membind_v2);
228 ^
229 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
230 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
231 ^
>> 232 libnuma.c:1157:1: error: only weak aliases are supported on darwin
233 make_internal_alias(numa_get_mems_allowed);
234 ^
235 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
236 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
237 ^
>> 238 libnuma.c:1399:1: error: only weak aliases are supported on darwin
239 make_internal_alias(numa_node_to_cpus_v1);
240 ^
241 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
242 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
243 ^
>> 244 libnuma.c:1400:1: error: only weak aliases are supported on darwin
245 make_internal_alias(numa_node_to_cpus_v2);
246 ^
247 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
248 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
249 ^
>> 250 libnuma.c:1538:1: error: only weak aliases are supported on darwin
251 make_internal_alias(numa_run_on_node_mask_v2);
252 ^
253 ./numaint.h:18:73: note: expanded from macro 'make_internal_alias'
254 #define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden")))
255 ^
256 2 warnings and 14 errors generated.
>> 257 make[1]: *** [libnuma.lo] Error 1
>> 258 affinity.c:40:10: fatal error: 'linux/rtnetlink.h' file not found
259 #include <linux/rtnetlink.h>
260 ^~~~~~~~~~~~~~~~~~~
261 1 error generated.
>> 262 make[1]: *** [affinity.lo] Error 1
>> 263 make: *** [all] Error 2
We either need to figure out why numactl isn't building, or figure out if it is actually needed. Either way, the default MPI provider no longer building on macOS is a no-go for this PR.
|
|
||
| def cmake_args(self): | ||
| cmake_args = ["-DCMAKE_INSTALL_SYSCONFDIR=" + | ||
| join_path(self.spec.prefix, 'etc')] |
There was a problem hiding this comment.
You could use self.spec.prefix.etc here instead of join_path.
| def check_platform(self): | ||
| if not (sys.platform.startswith('freebsd') or | ||
| sys.platform.startswith('linux')): | ||
| raise InstallError("libnl requires FreeBSD or Linux") |
There was a problem hiding this comment.
What about macOS? It's BSD-based.
There was a problem hiding this comment.
confirmed libnl won't install on OSX:
150 /Users/baber/spack/var/spack/stage/libnl-3.3.0-vaxm4jqob6rr7i2ignz7tqbkknfforql/libnl-3.3.0/include/netlink-private/netlink.h:41:10: fatal error: 'linux/types.h' file not found 151 #include <linux/types.h>
|
numactl not needed on os-x, so added an exclusion for that, just like #7509 |
|
confirmed default install on osx working again |
…bpsm2 seem to be working
…ith-libfabric line
…library links not there without the opa-psm2 module loaded.
|
Rebased to remove some extraneous commits. I think this is ready for another review & hopefully merge. |
This PR builds on #4168.
The idea is to have a libfabric in spack use other libraries from spack (e.g. verbs and psm2), and in turn have openmpi use libfabric instead of building directly against verbs or psm2.
The +spackfabrics variant of libfabrics is a bit clunky feeling, but allows people to do nothing & continue using system libraries. If that's not needed, I'm happy just making opa-psm2 and rdma-core default dependencies of libfabrics instead, and letting them be overridden as external packages instead as needed.