Bug Description
Back when we've got a support for the GC protocol, while it is not usual for __traverse__ to do anything more than calling visit.call, it was assumed that running any Python code is not harmful and currently __traverse__ trampoline is no more than an ordinary trampoline with a slightly different signature.
It turns out that not only this assumption is wrong but also any currently generated tp_traverse implementation is prone to crash. CPython has reused PyGC_Head fields to store intermediate refcounts for a long time, so tp_traverse should be never able to read or write any PyGC_Head fields.
In particular the __traverse__ trampoline would first allocate a new GILPool and apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corrupted PyGC_Head (e.g. PyObject_GC_Untrack or PyObject_GC_Del). For this reason I believe this is the root cause of #1623 and #3064; both suggests that PyGC_Head is corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).
Reproduction
Too lengthy to open by default
I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following src/lib.rs and typical Cargo.toml (which has been omitted):
use pyo3::{prelude::*, PyTraverseError, PyVisit};
use std::{cell::RefCell, thread};
#[pyclass]
struct Ref {
inner: RefCell<Option<Py<Self>>>,
clear_on_traverse: bool,
}
impl Ref {
fn new(clear_on_traverse: bool) -> Self {
Self {
inner: RefCell::new(None),
clear_on_traverse,
}
}
}
#[pymethods]
impl Ref {
fn __traverse__(&self, visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
if self.clear_on_traverse {
self.inner.take();
} else {
if let Some(inner) = &*self.inner.borrow() {
visit.call(inner)?;
}
}
Ok(())
}
fn __clear__(&mut self) {
self.inner.take();
}
}
#[pyfunction]
fn crash1() {
thread::spawn(move || {
_ = Python::with_gil(|py| -> PyResult<_> {
let a = Py::new(py, Ref::new(false))?;
a.borrow_mut(py).inner.replace(Some(a.clone_ref(py)));
let b = Py::new(py, Ref::new(false)))?;
Ok((a, b))
});
eprintln!("thread finished");
});
Ok(())
}
#[pyfunction]
fn crash2() {
thread::spawn(move || {
_ = Python::with_gil(|py| -> PyResult<_> {
let a1 = Py::new(py, Ref::new(true))?;
let a2 = Py::new(py, Ref::new(false))?;
a1.borrow_mut(py).inner.replace(Some(a2.clone_ref(py)));
a2.borrow_mut(py).inner.replace(Some(a1.clone_ref(py)));
let b = Py::new(py, Ref::new(false)))?;
Ok((a1, b))
});
eprintln!("thread finished");
});
}
#[pymodule]
#[pyo3(name = "pyo3_crash_test")]
fn module_entry(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(crash1, m)?)?;
m.add_function(wrap_pyfunction!(crash2, m)?)?;
Ok(())
}
Running the following Python code will crash inside gc.collect():
import sys
import gc
import time
import pyo3_crash_test
pyo3_crash_test.crash1() # or crash2
time.sleep(0.1) # not strictly required, but ensures that the thread finishes before GC
print('collecting...', file=sys.stderr)
gc.collect()
assert False, 'crash expected during GC'
Here Ref is holding an optional reference to other Ref, possibly itself. Setting clear_on_traverse will clear the reference during the traversal, simulating an otherwise totally safe operation to do.
In the crash1 the refcounts for a and b will be 2 and 1 respectively at the time of GC. GC will overwrite PyGC_Head._gc_prev for both a and b (effectively converting the list to singly-linked), and then __traverse__ will be called for a, creating a new GILPool and triggering ReferencePool::update_counts. As both values are registered for Py_DECREF the refcount of b will be now 0, and this will eventually call PyObject_GC_Del which assumes that the list is still doubly-linked and crashes.
The situation is similar for crash2: a2 gets dropped with the GIL still acquired, so the refcounts for a1, a2 and b are 2, 1 and 1 respectively. Now assume that GILPool::new somehow recognizes __traverse__ and doesn't call update_counts. Even in this case a1 will now remove a reference to a2 during __traverse__, falsely beliving Py_DECREF is safe to call as the GIL is acquired, so the same thing will happen for a2.
Possible Solution
It seems evident to me that __traverse__ should be handled as if no GIL is (and can be) acquired. This implies that Python::with_gil should probably fail and gil_is_acquired should return false inside __traverse__.
I believe it is also possible to have multiple nested __traverse__ calls. The GC process drains the linked list of objects while updating PyGC_Head and it sets gcstate->collecting to 1 during the process anyway, so __traverse__ cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):
- Creating new objects may trigger
PyObject_GC_Track, which implicitly calls tp_traverse in debug builds.
gc.get_referrers and gc.get_referrents use tp_traverse to collect objects, and are not affected by gcstate->collecting. They can be independently nested if __traverse__ somehow calls those functions.
- While directly calling
gc.collect() checks for gcstate->collecting and returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC) and the actual GC (_Py_RunGC) and some opcode may end up calling gc.collect() inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.
Probably it is sufficient to add the second "traversal" counter for the nested __traverse__ calls in addition to GIL_COUNT. If the traversal counter is non-zero the GIL cannot be acquired and gil_is_acquired should return false regardless of GIL_COUNT. The GIL_COUNT is still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which point update_counts should be considered (but only when GIL_COUNT was non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API from tp_traverse and someday CPython may touch PyGC_Head from unexpected functions---I mean, even gcstate->collecting check is not complete so who knows? If possible, a fully static solution to prevent anything but PyVisit would be much more desirable.
Suggestions
Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately GILPool::new call is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects via PyObject_GC_UnTrack, but that pretty much contradicts why we have tp_traverse in the first place.
So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside __traverse__. No promises here, but I hope to file a small PR to disable update_counts from the trampoline soon. In my local testing this fixes crash1 but not crash2.
Metadata
Your operating system and version
x86-64 Linux (at least)
Your Python version (python --version)
3.8.10, 3.9.16, 3.10.11, 3.11.13
Your Rust version (rustc --version)
1.69.0
Your PyO3 version
0.18.3
How did you install python? Did you use a virtualenv?
deadsnakes PPA, the bundled version of venv, pip and the most recent version of maturin (0.15.2).
Bug Description
Back when we've got a support for the GC protocol, while it is not usual for
__traverse__to do anything more than callingvisit.call, it was assumed that running any Python code is not harmful and currently__traverse__trampoline is no more than an ordinary trampoline with a slightly different signature.It turns out that not only this assumption is wrong but also any currently generated
tp_traverseimplementation is prone to crash. CPython has reusedPyGC_Headfields to store intermediate refcounts for a long time, sotp_traverseshould be never able to read or write anyPyGC_Headfields.In particular the
__traverse__trampoline would first allocate a newGILPooland apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corruptedPyGC_Head(e.g.PyObject_GC_UntrackorPyObject_GC_Del). For this reason I believe this is the root cause of #1623 and #3064; both suggests thatPyGC_Headis corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).Reproduction
Too lengthy to open by default
I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following
src/lib.rsand typicalCargo.toml(which has been omitted):Running the following Python code will crash inside
gc.collect():Here
Refis holding an optional reference to otherRef, possibly itself. Settingclear_on_traversewill clear the reference during the traversal, simulating an otherwise totally safe operation to do.In the
crash1the refcounts foraandbwill be 2 and 1 respectively at the time of GC. GC will overwritePyGC_Head._gc_prevfor bothaandb(effectively converting the list to singly-linked), and then__traverse__will be called fora, creating a newGILPooland triggeringReferencePool::update_counts. As both values are registered forPy_DECREFthe refcount ofbwill be now 0, and this will eventually callPyObject_GC_Delwhich assumes that the list is still doubly-linked and crashes.The situation is similar for
crash2:a2gets dropped with the GIL still acquired, so the refcounts fora1,a2andbare 2, 1 and 1 respectively. Now assume thatGILPool::newsomehow recognizes__traverse__and doesn't callupdate_counts. Even in this casea1will now remove a reference toa2during__traverse__, falsely belivingPy_DECREFis safe to call as the GIL is acquired, so the same thing will happen fora2.Possible Solution
It seems evident to me that
__traverse__should be handled as if no GIL is (and can be) acquired. This implies thatPython::with_gilshould probably fail andgil_is_acquiredshould return false inside__traverse__.I believe it is also possible to have multiple nested
__traverse__calls. The GC process drains the linked list of objects while updatingPyGC_Headand it setsgcstate->collectingto 1 during the process anyway, so__traverse__cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):PyObject_GC_Track, which implicitly callstp_traversein debug builds.gc.get_referrersandgc.get_referrentsusetp_traverseto collect objects, and are not affected bygcstate->collecting. They can be independently nested if__traverse__somehow calls those functions.gc.collect()checks forgcstate->collectingand returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC) and the actual GC (_Py_RunGC) and some opcode may end up callinggc.collect()inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.Probably it is sufficient to add the second "traversal" counter for the nested
__traverse__calls in addition toGIL_COUNT. If the traversal counter is non-zero the GIL cannot be acquired andgil_is_acquiredshould return false regardless ofGIL_COUNT. TheGIL_COUNTis still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which pointupdate_countsshould be considered (but only whenGIL_COUNTwas non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API fromtp_traverseand someday CPython may touchPyGC_Headfrom unexpected functions---I mean, evengcstate->collectingcheck is not complete so who knows? If possible, a fully static solution to prevent anything butPyVisitwould be much more desirable.Suggestions
Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately
GILPool::newcall is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects viaPyObject_GC_UnTrack, but that pretty much contradicts why we havetp_traversein the first place.So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside
__traverse__. No promises here, but I hope to file a small PR to disableupdate_countsfrom the trampoline soon. In my local testing this fixescrash1but notcrash2.Metadata
Your operating system and version
x86-64 Linux (at least)
Your Python version (
python --version)3.8.10, 3.9.16, 3.10.11, 3.11.13
Your Rust version (
rustc --version)1.69.0
Your PyO3 version
0.18.3
How did you install python? Did you use a virtualenv?
deadsnakesPPA, the bundled version ofvenv,pipand the most recent version ofmaturin(0.15.2).