Skip to content

Commit bf8982b

Browse files
committed
fix(global_cache): Py_IsFinalizing guards on dict-watcher + clear()
ARM64 pydebug shutdown SEGV root-cause: GlobalCacheKey holds raw PyDictObject* pointers to globals/builtins. When globals is freed during Py_FinalizeEx (interpreter_clear → PyDict_Clear), the cache entry remains in map_ + watch_map_[builtins]. Subsequent PyDict_EVENT_CLEARED on builtins triggers updateCache, which derefs cache.key().globals → use-after-free. Pydebug poison fill (0xdd...) makes the UAF a hard SEGV; x86_64 release reads stale-but-readable type pointer and silently passes the JIT_CHECK. Two surgical guards address the FINALIZE-SPECIFIC manifestation: Part A — Python/jit_support/phoenix_init.cpp phoenix_dict_watcher early-returns when Py_IsFinalizing(). Cache state is about to be destroyed in phoenix_free; processing events on tearing-down dicts is moot. Pattern matches pyjit.cpp:1642 funcDestroyed precedent. Part B — Python/jit/global_cache.cpp GlobalCacheManager::clear() short-circuits when Py_IsFinalizing(). Avoids two failure modes: 1. Ci_Watchers_UnwatchDict (PyDict_Unwatch) reads dict internals on freed memory → UAF under pydebug 2. ci_watcher_state_fini ran first in phoenix_free → watcher_id is -1 → JIT_CHECK 'Invalid dict watcher ID -1' abort (observed in earlier gate logs, regression historically) SCOPE HONESTY: this is FINALIZE-SPECIFIC protection, NOT full root-cause for the raw-pointer-aliasing class. Mid-execution module unload (refcount→0 on globals while cache holds raw pointer) remains unaddressed. Filed as W27 GlobalCacheKey raw-pointer lifecycle re-architecture (docs/w27-globalcachekey-lifecycle.md, Tier 6+ scope). The 314d0f2 'weak reference semantics' comment was FALSIFIED — no mechanism actually provided weak-reference semantics; comments updated to name the actual mechanisms (Py_IsFinalizing guard, finalize-only protection). Diagnostic evidence (per Debug-First protocol): lldb backtrace on ARM64 pydebug push 59 falsifier: PyType_HasFeature(type=0xdddddddddddddddd) object.h:966 <-- SEGV hasOnlyUnicodeKeys() dict.h:52 GlobalCacheManager::updateCache() global_cache.cpp:325 GlobalCacheManager::notifyDictClear() global_cache.cpp:144 phoenix_dict_watcher() phoenix_init.cpp:58 PyDict_Clear → interpreter_clear → Py_FinalizeEx Test plan post-commit: testkeeper ARM64 pydebug --clean rebuild + falsifier rerun (test_phoenix_jit_inline_except_closure under timeout+redirect) + push 58 baseline regression check (8b7b935 still passes). testkeeper x86_64 compile-only PASS verified pre-commit at binary timestamp 1776874355 (chat L1872 build verification). Authorization chain: supervisor option (i) at chat L1862, theologian LEVEL-1 ratification at chat L1861. Pythia python#75 python#1 (trigger isolation) gates downstream emitCallExceptionHandler queue, NOT this fix.
1 parent b44f075 commit bf8982b

3 files changed

Lines changed: 137 additions & 6 deletions

File tree

Python/jit/global_cache.cpp

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
#include "cinderx/Jit/global_cache.h"
44

55
#include "cinderx/Common/dict.h"
6+
#include "cinderx/Common/py-portability.h"
67
#include "cinderx/Common/util.h"
78
#include "cinderx/Common/watchers_c.h"
89
#include "cinderx/Jit/threaded_compile.h"
@@ -181,6 +182,28 @@ void GlobalCacheManager::notifyDictUnwatch(PyDictObject* dict) {
181182
}
182183

183184
void GlobalCacheManager::clear() {
185+
// During Py_FinalizeEx, watched dicts may already be torn down by
186+
// interpreter_clear (PyDict_Clear on globals/builtins). Two
187+
// failure modes if we proceed:
188+
// 1. Ci_Watchers_UnwatchDict (PyDict_Unwatch) reads dict internals →
189+
// use-after-free on poison-filled freed memory under pydebug.
190+
// 2. ci_watcher_state_fini ran first in phoenix_free, so the dict
191+
// watcher_id is -1 → PyDict_Unwatch returns -1 → JIT_CHECK abort
192+
// (observed historically in gate logs as 'Invalid dict watcher ID -1').
193+
//
194+
// Short-circuit: drop in-memory cache state without notifying anyone.
195+
// The map_ + watch_map_ destruct naturally when ~GlobalCacheManager
196+
// returns; Ref<PyUnicodeObject> in GlobalCacheKey targets immortal
197+
// interned strings, so destruction is safe.
198+
//
199+
// NOTE: this is FINALIZE-SPECIFIC protection paired with the
200+
// Py_IsFinalizing guard in phoenix_dict_watcher. Mid-execution module
201+
// unload (cache holds raw ptr to a globals/builtins dict that gets
202+
// freed) remains unaddressed — see W27 GlobalCacheKey raw-pointer
203+
// lifecycle re-architecture.
204+
if (Py_IsFinalizing()) {
205+
return;
206+
}
184207
std::vector<PyDictObject*> keys;
185208
for (auto& pair : watch_map_) {
186209
keys.push_back(pair.first);

Python/jit_support/phoenix_init.cpp

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include "cinderx/Jit/frame.h"
1616
#include "cinderx/Common/code.h"
1717
#include "cinderx/Common/log.h"
18+
#include "cinderx/Common/py-portability.h"
1819
#include "cinderx/Common/watchers_c.h"
1920
#include "cinderx/Common/util.h"
2021
#include "cinderx/module_c_state.h"
@@ -30,6 +31,25 @@ static int phoenix_code_watcher(PyCodeEvent event, PyCodeObject* co) {
3031
static int phoenix_dict_watcher(
3132
PyDict_WatchEvent event, PyObject* dict_obj,
3233
PyObject* key_obj, PyObject* new_value) {
34+
/* During Py_FinalizeEx, dicts are torn down (interpreter_clear,
35+
module finalization). The GlobalCacheKey holds raw pointers to
36+
globals + builtins dicts — once one of those is freed, processing
37+
a CLEARED/MODIFIED event on the surviving companion would deref
38+
the freed pointer (poisoned 0xdd... fill under pydebug → SEGV).
39+
40+
Skipping events during finalize is correct semantics: the cache
41+
and its watch state are about to be destroyed in phoenix_free.
42+
Same pattern as pyjit.cpp:1642 (funcDestroyed during finalize).
43+
44+
NOTE: this is FINALIZE-SPECIFIC protection, NOT a full root-cause
45+
fix for the raw-pointer-aliasing class. Mid-execution module
46+
unload (refcount→0 on globals while cache holds raw pointer)
47+
remains unaddressed — filed as W27 GlobalCacheKey raw-pointer
48+
lifecycle re-architecture. */
49+
if (Py_IsFinalizing()) {
50+
return 0;
51+
}
52+
3353
auto state = cinderx::getModuleState();
3454
jit::IGlobalCacheManager* globalCaches =
3555
state != nullptr ? state->cacheManager() : nullptr;
@@ -59,12 +79,13 @@ static int phoenix_dict_watcher(
5979
break;
6080
case PyDict_EVENT_CLONED:
6181
case PyDict_EVENT_DEALLOCATED:
62-
/* Skip notifyDictUnwatch during deallocation/clone —
63-
the dict may be partially freed during GC, and
64-
notifyDictUnwatch accesses dict internals that can
65-
trigger use-after-free. The GlobalCacheManager will
66-
naturally stop watching when the dict pointer becomes
67-
invalid (weak reference semantics). */
82+
/* Skip notifyDictUnwatch during deallocation/clone outside
83+
finalize — the dict may be partially freed during GC, and
84+
notifyDictUnwatch could call Ci_Watchers_UnwatchDict on a
85+
companion dict whose tp_dealloc is concurrent (the
86+
2026-03-30 GC re-entrancy crash, fix 314d0f2310). The
87+
finalize-specific class is now handled by the
88+
Py_IsFinalizing early-return above. */
6889
break;
6990
}
7091
return 0;
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# W27 — GlobalCacheKey Raw-Pointer Lifecycle Re-Architecture
2+
3+
**Status:** Filed (generalist, 2026-04-22, supervisor-authorized [chat ~16:04Z])
4+
**Owner:** unassigned (Tier 6+ scope)
5+
**Schedule:** Post-Tier-5-close; not blocking any current work
6+
**Origin:** push 59 ARM64 pydebug SEGV root-cause (lldb @ global_cache.cpp:325)
7+
8+
---
9+
10+
## 1. Problem statement
11+
12+
`jit::GlobalCacheKey` (Python/jit/global_cache.h) holds **raw `PyDictObject*` pointers** to a function's `globals` and `builtins` dicts:
13+
14+
```cpp
15+
struct GlobalCacheKey {
16+
PyDictObject* builtins; // raw pointer
17+
PyDictObject* globals; // raw pointer
18+
Ref<PyUnicodeObject> name; // refcounted
19+
...
20+
};
21+
```
22+
23+
These pointers can outlive the dicts they reference. `GlobalCacheManager` watches each dict via the dict-watcher API, but the **PyDict_EVENT_DEALLOCATED** handler is intentionally a no-op (commit 314d0f2310, March 2026) due to a 2026-03-30 GC re-entrancy crash in `notifyDictUnwatch`.
24+
25+
Result: when `globals` is freed while `builtins` is still alive, the cache entry remains in `map_` with a stale `globals` pointer. Subsequent events on `builtins` (e.g. `notifyDictClear`) call `updateCache(cache, builtins, nullptr)`, which dereferences `cache.key().globals` → use-after-free.
26+
27+
## 2. Observed manifestation
28+
29+
- **Push 59** (b44f0752eb, 2026-04-22): falsifier `test_phoenix_jit_inline_except_closure` triggered ARM64 pydebug SEGV.
30+
- lldb backtrace: `phoenix_dict_watcher → notifyDictClear → updateCache → hasOnlyUnicodeKeys → PyType_HasFeature(0xdddd…)`.
31+
- Pattern: pydebug poison-fill on freed PyTypeObject reached via stale `globals` pointer.
32+
- x86_64-release silent because freed memory still contains stale-but-readable type pointer.
33+
34+
## 3. Finalize-phase mitigation (LANDED, this commit)
35+
36+
Two defensive guards address the **finalize-specific** manifestation:
37+
- `phoenix_dict_watcher`: early-return on `Py_IsFinalizing()` — skip processing all dict events during shutdown.
38+
- `GlobalCacheManager::clear()`: short-circuit on `Py_IsFinalizing()` — drop in-memory state without calling `Ci_Watchers_UnwatchDict` (watcher infra also torn down by then).
39+
40+
These DO NOT address the deeper class.
41+
42+
## 4. Unaddressed class
43+
44+
**Mid-execution module unload**. If a Python module is unloaded (`del sys.modules[name]`, refcount on its `__dict__` reaches 0, dict deallocates) while the JIT cache holds a `GlobalCacheKey` keyed on that dict, the same UAF triggers — but `Py_IsFinalizing()` is `false`, so neither guard fires.
45+
46+
Reproduction sketch (not yet tested):
47+
```python
48+
import sys, weakref, gc
49+
import some_jit_compiled_module as m
50+
m.compiled_function() # registers global cache entry
51+
ref = weakref.ref(m.__dict__)
52+
del sys.modules['some_jit_compiled_module']
53+
del m
54+
gc.collect()
55+
assert ref() is None # globals freed
56+
# Any subsequent event on builtins that touches the stale cache → UAF
57+
```
58+
59+
## 5. Re-architecture options
60+
61+
### Option A: Refcount the cached dicts
62+
Hold `Ref<PyDictObject>` instead of raw `PyDictObject*` in `GlobalCacheKey`. Pro: simplest. Con: extends dict lifetime, may interfere with module unload semantics (modules expected to be GC'd after `del sys.modules[name]`).
63+
64+
### Option B: Weak references via dict watcher
65+
Implement true weak-reference semantics by handling `PyDict_EVENT_DEALLOCATED` correctly. Requires resolving the 2026-03-30 GC re-entrancy crash. Suggested approach: defer cache cleanup to a per-thread queue drained on the next safe point, rather than synchronous `notifyDictUnwatch` from inside the watcher callback.
66+
67+
### Option C: Cache validation on read
68+
Mark cache entries invalid via a separate shadow map indexed by dict pointer; check validity at every cache read. Pro: lifecycle-independent. Con: per-read overhead.
69+
70+
### Option D: Restructure cache key
71+
Drop `globals`/`builtins` raw pointers from the key entirely; key on the JIT'd function's identity, look up dicts dynamically. Pro: no aliasing issue. Con: changes hash semantics, cache invalidation gets complex.
72+
73+
Likely best path: **Option B** with deferred-cleanup queue. Matches CPython's "watcher callback should not mutate watched state" guidance and resolves the 2026-03-30 re-entrancy.
74+
75+
## 6. Sequencing
76+
77+
- Pre-req: Tier 5 close (zero C++ in builder.cpp emit methods)
78+
- Falsifier needed before fix lands: mid-execution module-unload reproducer
79+
- Estimated scope: 3-4 hours for Option B, 1-2 hours for Option A (with caveats)
80+
- Risk: medium — touches global cache invariants, requires re-verification of all global-cache-related tests
81+
82+
## 7. Cross-references
83+
84+
- Push 59 root-cause analysis: chat 2026-04-22 ~15:55-16:04Z
85+
- Original DEALLOCATED-skip: 314d0f2310 (Alex, 2026-03-30)
86+
- Finalize-phase mitigation: this commit (paired with push 59)
87+
- Related lifecycle pattern: feedback_no_workarounds.md

0 commit comments

Comments
 (0)