Skip to content

Commit 9a0fc0e

Browse files
committed
W-PYTORCH-CM-(ii): park behind pure-C JIT roadmap with failing-test sentinel
Adds a regression-test sentinel + consolidated known-bug entry for the StoreAttr managed-dict tag-flip corruption (W-PYTORCH-CM-(ii)) per Alex 2026-04-27T07:12:25Z (D-1777270945) and supervisor cascade 07:13:18Z: "make sure this bug (ii) has a failing test ... fix it _after_ getting the whole project to pure C". Lib/test/test_phoenix_jit_storeattr_managed_dict_tag_flip.py @unittest.expectedFailure subprocess test running the canonical /tmp/repro_s3.py harness verbatim (preserved byte-equivalent to avoid perturbing the timing-sensitive trigger; D-1777190733). SEGV in the subprocess crashes the harness, not the unittest runner. CI reports "expected failure"; an "unexpectedSuccess" indicates the bug has been fixed and the decorator should be removed. docs/known-bugs/bug-ii-storeattr-corruption.md Consolidates the source-trace from docs/w-pytorch-cm-tooling-note.md: symptom (NULL+0xAB SEGV in PyDict_SetItem), mechanism (LSB-clear at obj+0x18 flips PEP 697 IsValues to IsDict misinterpretation), 5-class hypothesis enumeration ((a)/(b)/(d) OPEN, (c)/(e) FALSIFIED with cited evidence), trigger-sensitivity caveat, parking rationale, resumption gate, and pointers to the two heavy-tier instrumentation designs already on disk (tp_alloc-watchpoint + allocate-counter; both ~200 LOC + rebuild, heavy-tier-authorization gated). Verification: ./python_bench /tmp/repro_s3.py -> exit 139 (SEGV) ./python_bench -m test test_phoenix_jit_storeattr_managed_dict_tag_flip -v -> "expected failures=1", framework exit 0 No JIT/build changes; test exercises the existing JIT via subprocess + cinderjit.force_compile.
1 parent f97cf34 commit 9a0fc0e

2 files changed

Lines changed: 317 additions & 0 deletions

File tree

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
"""W-PYTORCH-CM-(ii) parked-bug regression test.
2+
3+
Reproduces the runtime StoreAttr managed-dict tag-flip corruption that
4+
fires under ``Tools/benchmark_phoenix.py:bench_pytorch_cm`` after
5+
``cinderjit.force_compile`` and a 50,000-iter warmup. Symptom is a SEGV
6+
at PyDict_SetItem (NULL+0xAB deref through Py_TYPE(NULL)->tp_flags),
7+
caused by an LSB-clear write at ``obj + 0x18`` flipping the PEP 697
8+
managed-dict tag from IsValues (low bits 0b111) to IsDict (low bit 0).
9+
The IsDict misinterpretation hands the slow path a non-pointer as the
10+
dict, so the next slot write dereferences NULL.
11+
12+
Per supervisor 2026-04-27T07:13:18Z (cascading Alex 2026-04-27T07:12:25Z
13+
"make sure this bug (ii) has a failing test ... fix it _after_ getting
14+
the whole project to pure C"): the bug is parked behind the pure-C JIT
15+
roadmap. Heavy-tier instrumentation designs for the writer hunt are on
16+
disk at ``docs/w-pytorch-cm-tp-alloc-watchpoint-design.md`` and
17+
``docs/w-pytorch-cm-allocate-counter-design.md``; full source-trace at
18+
``docs/w-pytorch-cm-tooling-note.md`` and the consolidated parked-bug
19+
entry at ``docs/known-bugs/bug-ii-storeattr-corruption.md``.
20+
21+
The test runs the canonical repro (``/tmp/repro_s3.py`` content,
22+
preserved here verbatim) in a subprocess so a SEGV crashes the harness,
23+
not the unittest runner. ``@unittest.expectedFailure`` swallows the
24+
resulting ``AssertionError`` so the parked bug does not block CI; an
25+
``unexpectedSuccess`` (subprocess returns 0 with ``S3 OK`` on stdout)
26+
signals the bug has been fixed and the decorator should be removed.
27+
28+
Trigger sensitivity caveat (per ``docs/w-pytorch-cm-tooling-note.md``
29+
D-1777190733): the bug is timing-sensitive. A Python ``__enter__``
30+
wrapper around the workload was already shown to perturb JIT timing
31+
enough to suppress the trigger. The test therefore execs the repro
32+
verbatim through ``sys.executable -c`` rather than wrapping it in
33+
unittest scaffolding.
34+
"""
35+
36+
import os
37+
import subprocess
38+
import sys
39+
import textwrap
40+
import unittest
41+
from pathlib import Path
42+
43+
try:
44+
import _cinderx # noqa: F401
45+
import cinderjit # noqa: F401
46+
HAS_JIT = True
47+
except ImportError:
48+
HAS_JIT = False
49+
50+
51+
REPO_ROOT = Path(__file__).resolve().parents[2]
52+
TOOLS_DIR = REPO_ROOT / "Tools"
53+
54+
55+
# Canonical repro from /tmp/repro_s3.py (228 bytes, 7 LOC). Preserved
56+
# byte-equivalent — any edit risks perturbing the timing-sensitive
57+
# trigger.
58+
HARNESS_SOURCE = textwrap.dedent(
59+
"""\
60+
import sys; sys.path.insert(0, 'Tools')
61+
import _cinderx, cinderjit
62+
from benchmark_phoenix import bench_pytorch_cm
63+
bench_pytorch_cm(5000) # warmup
64+
cinderjit.force_compile(bench_pytorch_cm)
65+
bench_pytorch_cm(50000)
66+
print("S3 OK")
67+
"""
68+
)
69+
70+
71+
@unittest.skipUnless(HAS_JIT, "requires cinderjit")
72+
@unittest.skipUnless(
73+
(TOOLS_DIR / "benchmark_phoenix.py").exists(),
74+
"Tools/benchmark_phoenix.py not present",
75+
)
76+
class TestStoreAttrManagedDictTagFlip(unittest.TestCase):
77+
"""Parked-bug oracle for W-PYTORCH-CM-(ii)."""
78+
79+
@unittest.expectedFailure
80+
def test_pytorch_cm_no_segv_after_force_compile(self):
81+
"""Subprocess runs the canonical repro; expects clean exit + 'S3 OK'.
82+
83+
Currently parked: subprocess SEGVs (returncode != 0) and the
84+
AssertionError is swallowed by @expectedFailure. When the LSB-clear
85+
writer is identified and fixed (see docs/known-bugs/
86+
bug-ii-storeattr-corruption.md), this test will pass and surface
87+
as 'unexpectedSuccess' — at which point remove the decorator.
88+
"""
89+
proc = subprocess.run(
90+
[sys.executable, "-c", HARNESS_SOURCE],
91+
cwd=str(REPO_ROOT),
92+
capture_output=True,
93+
text=True,
94+
timeout=300,
95+
)
96+
self.assertEqual(
97+
proc.returncode,
98+
0,
99+
msg=(
100+
"bench_pytorch_cm(50000) post-force_compile crashed "
101+
"(rc={rc}); see docs/known-bugs/bug-ii-storeattr-"
102+
"corruption.md\nstdout:\n{out}\nstderr (last 40 lines):\n{err}"
103+
).format(
104+
rc=proc.returncode,
105+
out=proc.stdout,
106+
err="\n".join(proc.stderr.splitlines()[-40:]),
107+
),
108+
)
109+
self.assertIn(
110+
"S3 OK",
111+
proc.stdout,
112+
"harness completed without SEGV but did not print 'S3 OK'",
113+
)
114+
115+
116+
if __name__ == "__main__":
117+
unittest.main()
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# W-PYTORCH-CM-(ii) — StoreAttr managed-dict tag-flip corruption
2+
3+
**Status:** PARKED behind pure-C JIT roadmap completion (Alex
4+
2026-04-27T07:12:25Z, supervisor cascade 07:13:18Z; D-1777270945).
5+
Failing-test sentinel:
6+
``Lib/test/test_phoenix_jit_storeattr_managed_dict_tag_flip.py``
7+
(``@unittest.expectedFailure``).
8+
9+
**Workstream history:** D-1777180692 state-of-knowledge brief +
10+
``docs/w-pytorch-cm-tooling-note.md`` running investigation log.
11+
W-PYTORCH-CM was split 2026-04-26T11:37Z into (i) compile-time
12+
type-confusion (FIXED at push 63 by adding
13+
``hir_c_primitive_compare_op`` accessor) and (ii) the runtime
14+
StoreAttr corruption documented here. (ii) is structurally
15+
INDEPENDENT of (i) per testkeeper valgrind discriminator
16+
2026-04-26T11:37:15Z (D2 LSB transition still captured post-(i)
17+
fix).
18+
19+
## Symptom
20+
21+
```
22+
$ ./python /tmp/repro_s3.py
23+
... (50,000-iter bench_pytorch_cm post-force_compile) ...
24+
Segmentation fault (core dumped)
25+
```
26+
27+
Crash is a NULL+0xAB deref inside ``PyDict_SetItem`` reaching
28+
``Py_TYPE(NULL)->tp_flags`` (offset 0xAB into ``PyTypeObject``).
29+
Confirmed via ASAN on push 63 (testkeeper 2026-04-26T13:03Z): the
30+
SEGV is a **downstream consequence** of an LSB-clear at ``obj +
31+
0x18`` — not a wild write or UAF.
32+
33+
## Mechanism (narrowed; writer un-localized)
34+
35+
PEP 697 managed-dict encoding stores ``(char*)values_ptr - 1`` in the
36+
slot at ``obj + 0x18``. 8-aligned addresses end in ``0x0`` / ``0x8``,
37+
so the encoded form ends in ``0x7`` / ``0xF`` (low 3 bits ``0b111``)
38+
when IsValues is set. ``IsDict`` is signalled by LSB == 0.
39+
40+
Sequence observed in the repro:
41+
42+
1. ``D2[0]`` snapshot: slot byte 0 = ``0x97`` (correct IsValues
43+
encoding for ``values_ptr = 0x98``; T2.5 confirmed ``0x98`` is the
44+
heavily-recycled values chunk).
45+
2. ``D2[1]`` snapshot at the same ``obj`` address: slot byte 0 =
46+
``0x96`` — exactly one bit cleared (LSB).
47+
3. The IsDict path reads the now-LSB-zero word as a ``PyDictObject*``;
48+
``ob_type`` at offset 8 of ``0x96`` is NULL/junk.
49+
4. ``PyDict_SetItem`` is called with that NULL dict and SEGVs at
50+
``Py_TYPE(NULL)->tp_flags``.
51+
52+
**Class-invariant pattern:** byte 0 of ``obj + 0x18`` for ``_NoGrad``
53+
instances allocated by ``Tools/benchmark_phoenix.py:bench_pytorch_cm``
54+
gets its low bit cleared. Pattern cannot result from any vanilla
55+
CPython slot write (writes ``0x97`` IsValues, ``0x00`` NULL, or an
56+
8-aligned dict pointer ending ``0x0`` / ``0x8``).
57+
58+
**Source-level audit (Phoenix Python/cinderx + Python/jit):** NO
59+
direct writes to ``obj + 0x18``. The Phoenix source only READS via
60+
``_PyObject_DictOrValuesPointer`` (e.g. ``SplitMutator::setAttr`` /
61+
``getAttr``) using the correct macros.
62+
63+
**JIT-emit caveat (pythia #156 #1):** the source-grep audit covers
64+
source-level writes only. JIT-emitted machine-code writes (Phoenix
65+
runtime helpers, JIT-emitted prologues) are NOT testable by source
66+
grep. The writer for the LSB-clear remains undischarged by the
67+
audit.
68+
69+
## Hypothesis classes after cheap-tier exhaustion
70+
71+
(2026-04-26T14:21:09Z — discriminator-saturated, GENUINE PAUSE called
72+
by supervisor). Five candidates were enumerated; three are FALSIFIED;
73+
two and a half remain OPEN.
74+
75+
| Class | Description | Status |
76+
|-------|-------------|--------|
77+
| (a) | Narrow 1-byte writer at ``obj+0x18`` byte 0 (AND-with-~1, sub-1, or direct ``0x96`` store) | OPEN |
78+
| (b) | Wider write clipping LSB (2/4/8-byte store whose low byte happens to be ``0x96``) | OPEN |
79+
| (c) | Wild write / UAF coincidentally LSB-aligned at ``obj+0x18`` | FALSIFIED (ASAN on push 63: crash is NULL+0xAB deref, not UAF; LSB-clear is the cause not the corruption itself) |
80+
| (d) | Two-instance conflation — ``D2[0]`` and ``D2[1]`` are different recycled instances at the same address; no single-instance mutation occurred | OPEN (cannot be discriminated from header bytes — refcnt + type_ptr + first8 identical for fresh ``_NoGrad`` instances; testkeeper 2026-04-26T14:20:44Z) |
81+
| (e) | Cache-load-side: ``TypeAttrCache`` value slot baked into JIT'd code at compile, racing with cache-slot writer → JIT loads torn value → STORE writes corrupted value to ``obj+0x18`` | FALSIFIED at the per-frame SEGV-site enumeration (3/3 cache slots tested by hardware-watchpoint: TYPE 0xd33020, VALUE 0xd42018, ``cache_`` 0xd5b2a0 all stable post-fill, no runtime writes during workload). RESIDUAL CHEAP-TIER UNRUN: broader objdump-grep across compile-unit cache-load immediates not enumerated. |
82+
83+
## Trigger sensitivity
84+
85+
Bug is **TIMING-SENSITIVE** (D-1777190733, testkeeper 2026-04-26T08:03Z):
86+
the original LSB=0 trigger DID NOT reproduce when the workload was
87+
wrapped in a Python ``__enter__`` context manager. Wrapper added
88+
~100µs/iter of Python interpreter overhead, shifting JIT-call-counter
89+
timing relative to the auto-compile threshold and thereby evading the
90+
trigger window.
91+
92+
Implication for instrumentation: any printf-class observer that adds
93+
Python-level overhead may also evade. Heavy-tier discriminators
94+
(C-side allocate-counter, hardware watchpoint via ``tp_alloc`` hook)
95+
are the next observability tier.
96+
97+
## Reproducer
98+
99+
``/tmp/repro_s3.py`` (228 bytes, preserved verbatim in
100+
``Lib/test/test_phoenix_jit_storeattr_managed_dict_tag_flip.py`` as
101+
the test's HARNESS_SOURCE):
102+
103+
```python
104+
import sys; sys.path.insert(0, 'Tools')
105+
import _cinderx, cinderjit
106+
from benchmark_phoenix import bench_pytorch_cm
107+
bench_pytorch_cm(5000) # warmup
108+
cinderjit.force_compile(bench_pytorch_cm)
109+
bench_pytorch_cm(50000)
110+
print("S3 OK")
111+
```
112+
113+
``bench_pytorch_cm`` is a self-contained
114+
``Tools/benchmark_phoenix.py`` benchmark exercising nested context
115+
managers (``_NoGrad`` / ``_Autocast`` / ``_ProfileScope``) — the
116+
PyTorch-style pattern that prompted the workstream name. No
117+
``torch`` runtime dependency.
118+
119+
## Heavy-tier instrumentation designs (on disk, un-implemented)
120+
121+
Both ~200 LOC + rebuild; gated on heavy-tier authorization (Alex
122+
direction OR explicit team auth) per governance D-1777190699. Both
123+
documented by theologian under the 2026-04-26 stand-down and ready
124+
for resumption.
125+
126+
### tp_alloc hardware watchpoint
127+
``docs/w-pytorch-cm-tp-alloc-watchpoint-design.md``
128+
129+
Hook ``_NoGrad`` ``tp_alloc``; on each allocation set a 1-byte
130+
hardware watchpoint (DR0-DR3) on ``obj + 0x18`` with write-only
131+
trigger; SIGTRAP handler captures ``RIP`` + backtrace + register
132+
dump. Discriminates (a) narrow 1-byte writer vs (b) wider clipping
133+
write directly from the faulting instruction. (d) two-instance
134+
conflation manifests as "watchpoint never fires on watched instance
135+
even though ``D2`` captures the LSB transition on a different
136+
recycled instance".
137+
138+
### Allocate-counter side-table
139+
``docs/w-pytorch-cm-allocate-counter-design.md``
140+
141+
Add a 64-bit monotonic ``alloc_id`` per ``_NoGrad`` instance via a
142+
hash-table side-table (keyed on ``obj`` pointer; populated at
143+
``init_inline_values``, looked up at the ``D2`` print site).
144+
Discriminates (d) instance conflation from (a)/(b) single-instance
145+
mutation by comparing ``D2[0].alloc_id`` to ``D2[1].alloc_id`` at
146+
the same ``obj`` address.
147+
148+
**Recommended ordering** (per
149+
``w-pytorch-cm-tp-alloc-watchpoint-design.md`` §"Comparison"): if
150+
only one design is authorized, run ``tp_alloc`` watchpoint first —
151+
it directly identifies the writer when (a) or (b) holds. If the
152+
watchpoint never fires on the watched instance during a confirmed
153+
``D2`` transition, (d) becomes the load-bearing hypothesis and the
154+
allocate-counter design is then run.
155+
156+
## Why parked (Alex 2026-04-27T07:12:25Z)
157+
158+
Bug only fires under the contrived ``repro_s3.py`` 50,000-iter
159+
workload after explicit ``force_compile``. Not seen in:
160+
161+
- The CinderX prod codebase (``cinderx_dev`` oracle PASS;
162+
D-1775658159 11-day Alex prior-art).
163+
- The regular Phoenix test suite (480-test x86_64 + 483-test ARM64
164+
runs).
165+
- The 24-benchmark ABBA + per-commit 4-benchmark gate.
166+
167+
The fix-class falsifier (cinderx_dev oracle) shows core Cinder is
168+
structurally immune to this bug — Phoenix introduced it. Per
169+
``feedback_assume_phoenix_regression.md`` the bug is presumed
170+
Phoenix-introduced and warrants a real fix, but Alex's 07:12:25Z
171+
direction sequences it after the pure-C JIT roadmap is complete.
172+
173+
## Resumption gate
174+
175+
Before re-engaging the writer hunt:
176+
177+
1. Re-confirm ``Lib/test/test_phoenix_jit_storeattr_managed_dict_tag_flip.py``
178+
still ``expectedFailure``s on the current HEAD (subprocess SEGV
179+
reproducible).
180+
2. Read ``docs/w-pytorch-cm-tooling-note.md`` for the full
181+
investigation log including 6 falsified hypotheses, the 3-cycle
182+
D8 + T2.5 reconciliation, and the
183+
``shouldSkipCompilation``-skip-list anti-pattern warning (pythia
184+
#154 #4).
185+
3. Choose ``tp_alloc``-watchpoint, allocate-counter, or both per
186+
the comparison table in
187+
``w-pytorch-cm-tp-alloc-watchpoint-design.md``.
188+
4. Heavy-tier authorization required per governance D-1777190699 +
189+
D-1777270945 (Alex parking decision; resumption is the trigger to
190+
re-engage).
191+
192+
## Anti-pattern (do not adopt)
193+
194+
Per pythia #154 #4 + ``feedback_no_workarounds.md``: the path of
195+
least resistance after a multi-pivot investigation is appending
196+
``_NoGrad`` / ``_Autocast`` / context-manager types to Phoenix's
197+
``shouldSkipCompilation`` skip-list (``pyjit.cpp``). That is a
198+
WORKAROUND — it preserves the underlying bug class for future
199+
managed-dict types to re-trigger. Resumption agent must root-cause
200+
the LSB-clear writer; do NOT extend the skip-list.

0 commit comments

Comments
 (0)