Bug report
TSAN reports a race condition from a Python program that both uses threading.local() and also a native extension that attempts to acquire the GIL in a separate, native thread.
To reproduce, build both the Python interpreter and the native module with TSAN. An example native module is like this, but note that all that matters is that it spawns a native thread that attempts to acquire the GIL:
thread_haver.cc:
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
extern "C" {
static pthread_t t;
static void* DoWork(void* arg) {
printf("Thread trying to acquire GIL\n");
fflush(stdout);
PyGILState_STATE py_threadstate = PyGILState_Ensure(); // Race here!!
printf("Thread called with arg %p\n", arg);
PyGILState_Release(py_threadstate);
printf("Thread has released GIL\n");
return arg;
}
static PyObject* SomeNumber(PyObject* module, PyObject* object) {
if (pthread_create(&t, nullptr, DoWork, nullptr) != 0) {
fprintf(stderr, "pthread_create failed\n");
abort();
}
return PyLong_FromLong(reinterpret_cast<uintptr_t>(object));
}
static PyObject* AnotherNumber(PyObject* module, PyObject* object) {
if (pthread_join(t, nullptr) != 0) {
fprintf(stderr, "pthread_join failed\n");
abort();
}
return PyLong_FromLong(reinterpret_cast<uintptr_t>(object));
}
} // extern "C"
static PyMethodDef thbmod_methods[] = {
{"some_number", SomeNumber, METH_O, "Makes a number."},
{"another_number", AnotherNumber, METH_O, "Makes a number."},
{nullptr, nullptr, 0, nullptr} /* Sentinel */
};
static struct PyModuleDef thbmod = {
PyModuleDef_HEAD_INIT,
"thread_haver", /* name of module */
nullptr, /* module documentation, may be NULL */
1024, /* size of per-interpreter state of the module,
or -1 if the module keeps state in global variables. */
thbmod_methods,
};
PyMODINIT_FUNC PyInit_thread_haver() {
return PyModule_Create(&thbmod);
}
Compile this into a shared object and using TSAN with, say, ${CC} -fPIC -shared -O2 -fsanitize=thread -o thread_haver.so thread_haver.cc.
Now the data race happens if we create and destroy a threading.local() object in Python:
demo.py:
import threading
import thread_haver
print("Number: {}".format(thread_haver.some_number(thread_haver))) # starts a thread
for _ in range(10000):
_ = threading.local() # race here (?)
print("Number: {}".format(thread_haver.another_number(threading))) # joins the thread
Concretely, here is the TSAN output, for Python 3.9:
Thread trying to acquire GIL
Number: 135325830047024
==================
WARNING: ThreadSanitizer: data race (pid=[...])
Read of size 8 at 0x7b4400019f98 by main thread:
#0 local_clear [...]/Modules/_threadmodule.c:819:25 (python+0xcbc14e)
#1 local_dealloc [...]/Modules/_threadmodule.c:838:5 (python+0xcbbd1d)
#2 _Py_DECREF [...]/Include/object.h:447:9 (python+0x104efaa)
#3 _Py_XDECREF [...]/Include/object.h:514:9 (python+0x104efaa)
#4 insertdict [...]/Objects/dictobject.c:1123:5 (python+0x104efaa)
Previous write of size 8 at 0x7b4400019f98 by thread T1:
#0 malloc [...]/tsan/rtl/tsan_interceptors_posix.cpp:683:5 (python+0xbd28f1)
#1 _PyMem_RawMalloc [...]/Objects/obmalloc.c:116:11 (python+0x1083956)
Location is heap block of size 264 at 0x7b4400019f00 allocated by thread T1:
#0 malloc [...]/tsan/rtl/tsan_interceptors_posix.cpp:683:5 (python+0xbd28f1)
#1 _PyMem_RawMalloc [...]/Objects/obmalloc.c:116:11 (python+0x1083956)
Thread T1 (tid=3039745, running) created by main thread at:
#0 pthread_create [...]/tsan/rtl/tsan_interceptors_posix.cpp:1038:3 (python+0xbd4679)
#1 SomeNumber(_object*, _object*) thread_haver.cc:23:7 (thread_haver.so+0xb58)
#2 cfunction_vectorcall_O [...]/Objects/methodobject.c:516:24 (python+0x107cb3d)
SUMMARY: ThreadSanitizer: data race [...]/Modules/_threadmodule.c:819:25 in local_clear
==================
Thread called with arg (nil)
Thread has released GIL
Number: 135325829912464
ThreadSanitizer: reported 1 warnings
Unfortunately, the backtrace does not go into the details of PyGILState_Ensure(), but the race seems to be on tstate->dict in
|
if (tstate->dict && PyDict_GetItem(tstate->dict, self->key)) { |
from the
threading.local() deallocation function, and on the
tstate struct being allocated by
malloc (by the GIL acquisition?).
Your environment
- CPython versions tested on: 3.9 and 3.10
- Operating system and architecture: Linux (Debian-derived)
I suspect that there may be some TLS access that both the threading.local() deallocation function and the GIL acquisition perform and that may not be sufficiently synchronised. It could also be a bug in TSAN that it does not track TLS access correctly.
Linked PRs
Bug report
TSAN reports a race condition from a Python program that both uses
threading.local()and also a native extension that attempts to acquire the GIL in a separate, native thread.To reproduce, build both the Python interpreter and the native module with TSAN. An example native module is like this, but note that all that matters is that it spawns a native thread that attempts to acquire the GIL:
thread_haver.cc:
Compile this into a shared object and using TSAN with, say,
${CC} -fPIC -shared -O2 -fsanitize=thread -o thread_haver.so thread_haver.cc.Now the data race happens if we create and destroy a
threading.local()object in Python:demo.py:
Concretely, here is the TSAN output, for Python 3.9:
Unfortunately, the backtrace does not go into the details of
PyGILState_Ensure(), but the race seems to be ontstate->dictincpython/Modules/_threadmodule.c
Line 819 in 5ef90ee
threading.local()deallocation function, and on thetstatestruct being allocated bymalloc(by the GIL acquisition?).Your environment
I suspect that there may be some TLS access that both the
threading.local()deallocation function and the GIL acquisition perform and that may not be sufficiently synchronised. It could also be a bug in TSAN that it does not track TLS access correctly.Linked PRs
threading.local#100922threading.local(GH-100922). #100937threading.local(GH-100922). #100938threading.local(GH-100922) #100939HEAD_LOCK/HEAD_UNLOCKmacros #100953