feat(test-fill,test-specs): add t8n call caching to test specs by danceratopz · Pull Request #2084 · ethereum/execution-specs

danceratopz · 2026-01-27T13:21:19Z

🗒️ Description

Add an in-memory output cache for transition tool results during fixture generation. When multiple fixture formats share the same t8n inputs (e.g., blockchain_test and blockchain_test_engine), the cache eliminates redundant t8n calls.

Changes

Add OutputCache class that stores t8n results per test, keyed by call counter.
Add strip_fixture_format_from_node() to derive consistent cache keys across fixture formats.
Integrate cache lookup/store in TransitionTool.evaluate(), used by BlockchainTest.generate_block_data().
Add xdist_group markers to keep related formats on the same worker.
Sort test items during collection: slow tests first (LPT scheduling), cacheable formats grouped together, deterministic order.

Display cache hit/miss statistics in the pytest session summary.

=  T8n cache: 100% hit rate (29454/29454 tests expected), 32822 t8n calls saved =

Performance

The t8n-call-cache-specs branch reduces the py3 fixture fill step duration by 31% compared to other recent PRs (mean 18:58 vs 27:35, n=3 vs n=15).

Methodology

The analysis compares the timing of the "Run py3 tests" step from the py3 job from this PR with other PRs.

Metric: Duration of the "Run py3 tests" step within the py3 CI job (GitHub Actions test.yaml workflow).
Baseline: 15 successful runs from other PRs, filtered to runs created on or after 2026-02-12. This date cutoff ensures all baseline runs include the xdist worker count change from #2120 (merged Feb 12), which changed CI xdist parallelism - making the comparison fair.

Notes on the Achieved vs Expected Gain

The numbers below regarding % of t8n overhead shouldn't be taken too seriously, but it was fun to try and derive these 🙂

The py3 CI job generates 2 fixture formats for BlockchainTest specs and up to 3 for StateTest specs (blockchain_test_engine_x is not yet part of py3). The cache stores t8n output from blockchain_test and replays it for blockchain_test_engine, which shares the same transition_tool_cache_key.

Test item counts (`--collect-only`, `py3` tox environment)

Format	Count	Cached
`state_test`	23,422	no
`blockchain_test` (native)	~11,900	no (generates cache)
`blockchain_test_engine` (native)	~11,900	yes
`blockchain_test_from_state_test`	23,414	no (generates cache)
`blockchain_test_engine_from_state_test`	18,765	yes
Total	89,401

Native blockchain counts are estimated as an equal split of the 23,800 native blockchain items (65,979 total blockchain minus 23,414 and 18,765 derived-from-state-test items).

Not all StateTest specs generate blockchain_test_engine: 4,649 state tests produce blockchain_test but not the engine variant (due to tests for forks that predate the engine format).

Expected vs observed savings

	Value
Cached items	~30,665 (11,900 + 18,765)
Total items	89,401
t8n call reduction	30,665 / 89,401 = 34.3%
Observed fill time reduction	31.2%
Implied t8n share of fill time	31.2 / 34.3 = ~91%

The remaining ~9% of fill time is non-t8n overhead: fixture serialization, pre-allocation, and pytest machinery.

🔗 Related Issues or PRs

N/A.

✅ Checklist

All: Ran fast tox checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
```
uvx tox -e static
```
All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
All: Considered updating the online docs in the ./docs/ directory.
All: Set appropriate labels for the changes (only maintainers can apply labels).
Tests: Ran mkdocs serve locally and verified the auto-generated docs for new tests in the Test Case Reference are correctly formatted.

Cute Animal Picture

codecov · 2026-01-27T14:22:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.11%. Comparing base (4e9ec28) to head (910c76d).

Additional details and impacted files

@@               Coverage Diff                @@
##           forks/amsterdam    #2084   +/-   ##
================================================
  Coverage            86.11%   86.11%           
================================================
  Files                  599      599           
  Lines                39472    39472           
  Branches              3780     3780           
================================================
  Hits                 33992    33992           
  Misses                4852     4852           
  Partials               628      628

Flag	Coverage Δ
unittests	`86.11% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

packages/testing/src/execution_testing/cli/pytest_commands/plugins/filler/filler.py

marioevz

Running the latest version I see that we are adding @t8n-cache-<md5-hash> to the names of all tests for some reason. This seems incorrect to me, so I'd like to give this a proper re-review.

danceratopz · 2026-02-02T13:45:48Z

Running the latest version I see that we are adding @t8n-cache-<md5-hash> to the names of all tests for some reason. This seems incorrect to me, so I'd like to give this a proper re-review.

This is to ensure that all parametrized test formats (state_test, blockchain_test, blockchain_test_engine) for a single test case get distributed to the same xdist worker to ensure they use the same per-worker cache; the cache is not global across workers.

SamWilsn · 2026-02-03T21:23:53Z

Just a general comment (haven't looked at the code yet), but are the caches per-thread/per-process or global? If they aren't global, you might need to load group them to get the biggest benefit.

danceratopz · 2026-02-16T16:33:29Z

Just a general comment (haven't looked at the code yet), but are the caches per-thread/per-process or global? If they aren't global, you might need to load group them to get the biggest benefit.

The caches are per xdist process, not global. So yes, to make this work, fill marks tests with xdist_group markers and uses --dist=loadgroup to ensure that these group share the cache by running them on the same xdist worker.

- Enable pytest-xdist loadgroup distribution mode by default. - Required for xdist_group markers to control worker assignment.

- Add strip_fixture_format_from_nodeid() to extract base nodeid. - Add get_all_fixture_format_names() for format name lookup. - Used to ensure related fixture formats share cache keys.

- Add T8nOutputCache LRU cache class for storing t8n outputs. - Add t8n_output_cache field to FillingSession. - Add xdist_group markers during collection for --dist=loadgroup. - Use t8n-cache-{hash} prefix to distinguish from user-defined groups. - Strip cache-specific @t8n-cache-* suffix from nodeids in TestInfo.

- Add cache key helpers to BaseTest (_get_base_nodeid, _get_t8n_cache_key). - Add _get_filling_session() to access cache from test instances. - Cache t8n outputs in _generate_block_data() for reuse across formats. - Skip caching for engine_x and engine_sync variants (different execution).

- Test T8nOutputCache LRU behavior, eviction, and hit/miss tracking. - Test strip_fixture_format_from_nodeid for various nodeid patterns. - Test get_all_fixture_format_names ordering and contents. - Test cache key consistency across fixture format variants. - Test _strip_xdist_group_suffix preserves non-cache group markers.

Add tests to verify that test items are sorted during collection to ensure deterministic cache hits. The tests demonstrate: - Sorting groups related fixture formats by base nodeid. - Without xdist, items are correctly sorted. - With xdist, items are NOT sorted (BUG causing high variance). - Expected vs actual behavior comparison. The xfail test `test_xdist_sorting_required_for_cache_hits` asserts the correct behavior (sorting with xdist) and fails until the fix is applied.

- Add helper methods to TransitionToolCacheStats for serialization. - Initialize aggregated stats on xdist controller in pytest_configure. - Send worker stats via workeroutput in fixture teardown. - Add pytest_testnodedown hook to aggregate stats from workers. - Update pytest_terminal_summary to display aggregated stats.

- Clear `_cache` in `remove_cache()` to prevent stale data leakage. - Tests without `transition_tool_cache_key` (e.g., state_test) could previously retrieve cached results from prior tests via matching `call_counter` subkeys.

Sort test items by (is_slow, base_nodeid, nodeid) to optimize execution: - Slow tests first (LPT scheduling for xdist load balance). - Related fixture formats grouped together (cache locality). - Deterministic order within groups. If ANY fixture format variant of a test is marked slow, ALL variants are treated as slow to keep them grouped together for cache hits. Reuses the base_nodeid cache for xdist marker generation to avoid redundant strip_fixture_format_from_node calls.

BlockchainEngineXFixture and BlockchainEngineSyncFixture had can_use_cache=False which was dead code (never checked anywhere). Replace with transition_tool_cache_key="" which is the actual mechanism that controls caching — empty string means no caching.

For StateTest specs with --generate-all-formats, the _from_state_test label suffixes cause alphabetical sort to interleave cacheable and non-cacheable formats: blockchain_test_engine_from_state_test (cacheable) → blockchain_test_engine_x_from_state_test (non-cacheable, clears cache) → blockchain_test_from_state_test (cacheable, but cache is gone). Add has_cache_key to the sort key so cacheable formats cluster together within each base nodeid group, ensuring the second cacheable format hits the warm cache before any non-cacheable format clears it.

node_id_for_entropy strips fixture format and fork names from the node ID before hashing it for deterministic address generation. However, it did not strip the xdist @group_name suffix (e.g., @t8n-cache-abc12345), causing different addresses when running with vs without xdist workers. Strip the suffix so addresses are deterministic regardless of whether xdist is active.

Replace the raw hit/miss counts with an efficiency metric where 100% means all tests that could have hit the cache did hit it. Track unique cache keys to compute expected hits (total cacheable - unique keys). Also filter subkey stats to only count cacheable tests, eliminating phantom misses from non-cacheable tests that still interact with the OutputCache after remove_cache(). Before: T8n cache: key_hits=6, key_misses=6 (50.0%), subkey_hits=6, subkey_misses=18 (25.0%) After: T8n cache: 100% hit rate (6/6 expected), 6 t8n calls saved

Pydantic's ModelMetaclass caches __init__ wrappers for dynamically created classes. When pytester runs multiple fill sessions in-process, cached wrappers from an earlier session re-invoke __init__ re-entrantly in later sessions, causing generate() to run twice per test and doubling the opcode count. - Switch fill tests to runpytest_subprocess() for process isolation. - Normal `fill` runs are unaffected (each invocation is a fresh process).

The rebase introduced a second pytest_testnodedown definition for cache stats aggregation, shadowing the existing timing logs hook. Extract the cache stats logic into _aggregate_cache_stats() and call it from the single hook.

Remove dead code that was never called: _get_base_nodeid(), _get_t8n_cache_key(), _get_filling_session() from BaseTest and remove_opcode_count() from TransitionTool. Also remove the now-unused strip_fixture_format_from_node import, TYPE_CHECKING import, and guarded FillingSession import block from base.py.

Add a clear() method to OutputCache to encapsulate clearing the internal _cache dict and resetting the key, instead of having TransitionTool.remove_cache() reach into private members directly. Also fix the set_cache docstring ("LRU behavior" → "single-key eviction"), fix the cached_result truthiness check to use `is not None` for defensive correctness, and fix the node() docstring to say "pytest node" instead of "node ID".

The @t8n-cache-* xdist group suffix was leaking into the test_id and group_salt stored in pre-alloc groups, causing fixture output to differ from runs without xdist. Use _strip_xdist_group_suffix() (already used by node_to_test_info) on both the group_salt fallback and the test_id passed to add_test_pre.

Test set_key eviction, get/set round-trip, hit/miss counter accuracy, clear() behavior, and state across key changes. Uses sentinel objects as lightweight stand-ins for TransitionToolOutput.

danceratopz · 2026-02-17T15:02:18Z

I ran hasher compare on the output of the py3 env fixtures locally with (~30 mins) and without (45 mins) this PR and got an exact match. I also tested static tests with --generate-all-formats for EngineX.

danceratopz · 2026-02-17T15:26:33Z

There's a further possible optimization for the state_test format with Paris and Shanghai forks. I didn't want to complicate this PR any more so made a follow-up issue for this to evaluate whether it's worth the additional (fork-dependent) complexity:

Enable t8n caching for state_test on Paris/Shanghai #2225

marioevz

Looks great, thanks for implementing this!

marioevz · 2026-02-20T15:58:10Z

I'm fixing the merge conflict in a bit and then merging 👍

danceratopz added C-feat Category: an improvement or new feature A-test-fill Area: execution_testing.cli.pytest_commands.plugins.filler A-test-specs Area: execution_testing.specs labels Jan 27, 2026

marioevz self-requested a review January 27, 2026 14:15

danceratopz mentioned this pull request Jan 29, 2026

T8n call cache specs danceratopz/execution-specs#56

Merged

marioevz reviewed Jan 29, 2026

View reviewed changes

packages/testing/src/execution_testing/cli/pytest_commands/plugins/filler/filler.py Show resolved Hide resolved

marioevz requested changes Jan 30, 2026

View reviewed changes

danceratopz force-pushed the t8n-call-cache-specs branch 2 times, most recently from a294da7 to 0b598ec Compare February 2, 2026 13:52

danceratopz marked this pull request as draft February 2, 2026 13:56

danceratopz force-pushed the t8n-call-cache-specs branch from 0b598ec to bfae73d Compare February 2, 2026 16:50

danceratopz mentioned this pull request Feb 2, 2026

chore(ci): only fill native test formats in tox's pypy3 env #2116

Merged

3 tasks

danceratopz force-pushed the t8n-call-cache-specs branch from bfae73d to 069d10c Compare February 2, 2026 17:16

danceratopz force-pushed the t8n-call-cache-specs branch 2 times, most recently from f57d1d5 to bebe55a Compare February 16, 2026 14:22

danceratopz and others added 11 commits February 17, 2026 09:32

feat(fill): Add --dist loadgroup for xdist test grouping

dd23619

- Enable pytest-xdist loadgroup distribution mode by default. - Required for xdist_group markers to control worker assignment.

feat(fixtures): Add nodeid helpers for cache keys and xdist grouping

d362e85

- Add strip_fixture_format_from_nodeid() to extract base nodeid. - Add get_all_fixture_format_names() for format name lookup. - Used to ensure related fixture formats share cache keys.

refactor(fill): Keep xdist grouping without sorting for cache testing

390d17e

refactor(testing): Use fixture format directly

fe95fa5

refactor(testing): Move properties

bc59386

fix(testing): Remove fixture_format as property of the spec

fe8e586

fix(testing): Fill always count opcodes

601fbee

marioevz and others added 17 commits February 17, 2026 09:32

fix: typo

bb0d369

fix(testing): reset cache counters to avoid double-counting stats

1f66e95

fix(testing): use hash for xdist_group name to fix loadgroup scheduling

e1e0e1d

refactor(testing): Simplify OutputCache to single-key design

c3200bd

fix(testing): Clear cache data when test doesn't use caching

0d8f3d4

- Clear `_cache` in `remove_cache()` to prevent stale data leakage. - Tests without `transition_tool_cache_key` (e.g., state_test) could previously retrieve cached results from prior tests via matching `call_counter` subkeys.

fix(testing): Merge duplicate pytest_testnodedown hooks.

1badce3

The rebase introduced a second pytest_testnodedown definition for cache stats aggregation, shadowing the existing timing logs hook. Extract the cache stats logic into _aggregate_cache_stats() and call it from the single hook.

test(testing): Add unit tests for OutputCache.

eb2d1be

Test set_key eviction, get/set round-trip, hit/miss counter accuracy, clear() behavior, and state across key changes. Uses sentinel objects as lightweight stand-ins for TransitionToolOutput.

danceratopz force-pushed the t8n-call-cache-specs branch from bebe55a to eb2d1be Compare February 17, 2026 09:03

danceratopz marked this pull request as ready for review February 17, 2026 14:00

danceratopz requested a review from marioevz February 17, 2026 15:02

danceratopz mentioned this pull request Feb 17, 2026

Enable t8n caching for state_test on Paris/Shanghai #2225

Open

marioevz self-assigned this Feb 18, 2026

marioevz approved these changes Feb 20, 2026

View reviewed changes

marioevz added 2 commits February 20, 2026 16:59

Merge branch 'forks/amsterdam' into t8n-call-cache-specs

6d0e6a1

fix: typing

910c76d

marioevz merged commit b3d9be1 into ethereum:forks/amsterdam Feb 20, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test-fill,test-specs): add t8n call caching to test specs#2084

feat(test-fill,test-specs): add t8n call caching to test specs#2084
marioevz merged 30 commits intoethereum:forks/amsterdamfrom
danceratopz:t8n-call-cache-specs

danceratopz commented Jan 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

marioevz left a comment

Uh oh!

danceratopz commented Feb 2, 2026

Uh oh!

SamWilsn commented Feb 3, 2026

Uh oh!

danceratopz commented Feb 16, 2026

Uh oh!

danceratopz commented Feb 17, 2026 •

edited

Loading

Uh oh!

danceratopz commented Feb 17, 2026

Uh oh!

marioevz left a comment

Uh oh!

marioevz commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danceratopz commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗒️ Description

Changes

Performance

Methodology

Notes on the Achieved vs Expected Gain

Test item counts (--collect-only, py3 tox environment)

Expected vs observed savings

🔗 Related Issues or PRs

✅ Checklist

Cute Animal Picture

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

marioevz left a comment

Choose a reason for hiding this comment

Uh oh!

danceratopz commented Feb 2, 2026

Uh oh!

SamWilsn commented Feb 3, 2026

Uh oh!

danceratopz commented Feb 16, 2026

Uh oh!

danceratopz commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danceratopz commented Feb 17, 2026

Uh oh!

marioevz left a comment

Choose a reason for hiding this comment

Uh oh!

marioevz commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danceratopz commented Jan 27, 2026 •

edited

Loading

Test item counts (`--collect-only`, `py3` tox environment)

codecov bot commented Jan 27, 2026 •

edited

Loading

danceratopz commented Feb 17, 2026 •

edited

Loading