-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
As reported by @jfsantos
A simple import torch gives the error:
ImportError: dlopen: cannot load any more object with static TLS
Analysis of the issue
glibc has a table called the DTV. There is a slot for every dlopen'd library with TLS. Its use is not important for this discussion.
The DTV is resizable. However, in older versions of glibc, adding a library with static TLS will not resize the DTV, but do a conservative check that amounts to "have a I loaded more than 14 libraries with TLS". You can observe this empirically by dlopen'ing a bunch of libraries and then querying their DTV number using dlinfo(handle, RTLD_DI_TLS_MODID, &modid). What you will find is that you only get a DTV entry if you contain thread local state. Libraries with zero thread local state do not contribute to the DTV.
That's why #24911 failed to fix anything, because reducing the amount of thread-local storage is irrelevant.
That's also why changing import order can fix things, because if you change it in a way that loads all your "static TLS" libraries first, then future "dynamic TLS" libraries will resize the DTV like normal
It seems this issue was fixed by a glibc patch in 2014, which eliminates this check and lazily updates the DTV.
Some addenda:
- Thing that is confusing: "static" refers to the TLS access model, not the static storage specifier in C/C++. So the existence of static thread_local variables is not directly relevant to the issue.
As far as I know, no shared library that PyTorch creates uses static TLS (this can be verified by doing readelf -d foo.so | grep STATIC_TLS). However, conda libgomp has static tls. Proof:
root@7d1c5be3f092:/remote/subbtest# readelf -a -W /opt/conda/lib/libgomp.so | grep TLS
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
TLS 0x02ad38 0x000000000002bd38 0x000000000002bd38 0x000000 0x000078 R 0x8
0x000000000000001e (FLAGS) STATIC_TLS
We link against this libgomp in our conda builds. Therefore the problem.
A more direct way to solve the problem, besides reordering imports, is just to LD_PRELOAD=/opt/conda/lib/libgomp.so before your program.
A few takeaways from this:
- As long as PyTorch has a dependency on libgomp.so with static TLS, there is literally nothing we can do if some of our users decide to import a bunch of third-party libraries that have dynamic TLS, without importing libgomp. They'll gobble up all of the DTV space and libgomp will fail. Note that we exacerbate the problem by depending on libraries ourselves which have dynamic TLS, so that the ceiling is lower, but if the user imports enough libraries they will hit this problem, no matter how much or little TLS we use.
- Reiterating what suo said above, reducing use of TLS in PyTorch has no effect on this problem. There is no reason to reduce the amount of TLS you have in your program. Go wild with thread local. (The only marginal benefit is if you manage to eliminate ALL TLS. Which ain't happening.)
- Importing something that gets libgomp early enough should be the recommended workaround. We should use this to fix our CI, and we should detect this error and give our users an error message to this effect too. I'm not sure if there is a way to conveniently load libgomp from Python in Conda.
One additional takeaway: this issue was fixed by a 2014 glibc patch. This patch is in the glibc distributed with Xenial.
Trusty is considered unsupported by Ubuntu (the LTS commitment expired in April of this year). We are also removing trusty from our CI (cc @jamesr66a). So if you want this problem to really really (tm) go away, upgrading your linux is another path.