JIT: Make code patching during (un-/)linking thread-safe by neobrain · Pull Request #4528 · FEX-Emu/FEX

neobrain · 2025-04-21T15:37:18Z

Overview

LSE2 platforms give us free (?) atomic 128-bit stores. We can use these to make the block (de-/)linking code apply their patches atomically, which allows blocks to be shared across threads without adding further thread-safety issues.

Credits for this idea go to @Sonicadvance1 .

I think all potential tearing issues are considered, but I'm not fully confident my atomicity/cache behavior assumptions are correct. I've done my best 😇

Details

For context, the unpatched code generally looks like this:

0x00000000: ldr x1, +8
0x00000004: blr x1
0x00000008: <address of ExitFunctionLinker>
0x0000000c: <address of ExitFunctionLinker>
0x00000010: <GuestRIP>
0x00000014: <GuestRIP>

On LSE2 platforms (as specified at build-time via -mcpu), DirectBlockDelinker will compile roughly to the following assembly:

# load ExitFunctionLinker
ldr     x9, [x0, #2304]
# Load `ldr x1, +8` and `blr x1` encodings
mov     x10, #0x40                      // #64
movk    x10, #0x5800, lsl #16
mov     x0, x8
movk    x10, #0xd63f, lsl #48
# Apply 128-bit patch
stp     x10, x9, [x1, #-8]
# Flush caches
bl      109420 <__clear_cache@plt>

Without LSE2 enabled at build-time, less efficient code will be generated, which seems fine considering we'll be dropping support for those platforms soon.

Potential race conditions

There are some potential races to consider. Crucially, this concerns only robustness of code execution. The order of linking and unlinking itself requires external synchronization:

Two threads simultaneously unlinking the same block

The atomic 128-bit store in DirectBlockDelinker ensures there is no tearing.

Two threads simultaneously linking the same block

The threads will link the block to the same target, so either:

they will both atomically patch the branch instruction at the very beginning (ldr x1, +8 -> b +0x1234)
they will both atomically patch the 64-bits of data at offset 8 (<address of ExitFunctionLinker>->address of target)

Since the patches are applied atomically, there is no tearing.

One thread unlinking a block that is being executed in another thread

We don't have external synchronization to prevent this from happening, but we can look at potential tearing issues with regards to code execution.

The only critical state of the linking thread is at offset 4. If the unlinking thread's patch becomes visible at this point, the linking thread will jump to the unpatched target address. The code itself is robust, but this race condition requires external synchronization to be resolved.

One thread linking a block that is being executed in another thread

This is similar to the previous case, with the small nuance that the linking code will only patch the first 32-bits. The blr x1 instruction (and data) will remain untouched, so in case the other thread already loaded the value at 0x00000008, execution will indeed be able to continue properly.

The fallback case will patch the 64-bits at offset 0x8 instead, which is also fine since the load in the other thread is atomic due to being aligned.

One thread unlinking a block that is simultaneously being linked

I'm not sure if this can happen, but atomicity on both sides ensures that the resulting code is valid and tear-free even though the linking thread patches only a subset of the code. The result will be either of two code blocks, which is a race condition that requires external synchronization to be resolved.

Further resources

There's a good overview of cache/barrier instructions on ARM at https://mariokartwii.com/armv8/ch30.html .

neobrain · 2025-04-21T15:40:20Z

FEXCore/Source/Interface/Core/JIT/JIT.cpp

-}
+  emit.dc64(Frame->Pointers.Common.ExitFunctionLinker);

-static void IndirectBlockDelinker(FEXCore::Core::CpuStateFrame* Frame, FEXCore::Context::ExitFunctionLinkData* Record) {


Dropped this function for simplicity, but on second thought perhaps it should be kept around since this only needs to be an atomic 64-bit write without extra cache flushes?

Then again not sure if this is hit often enough in practice for the overhead to matter. Dropping it in favor of the unified 128-bit patch makes it much easier to consider potential races.

neobrain · 2025-04-21T15:54:03Z

Not sure about these InstCountCI changes - I'm guessing the script gets confused by the added padding? Otherwise, I don't see why the blocks would become smaller in some cases.

FEXCore/Source/Interface/Core/JIT/JIT.cpp

neobrain · 2025-04-21T19:22:38Z

FEXCore/Source/Interface/Core/JIT/JIT.cpp

-
-    // Add de-linking handler
-    Thread->LookupCache->AddBlockLink(GuestRip, Record, IndirectBlockDelinker);
+    std::atomic_ref<uint64_t>(Record->HostBranch).store(HostCode, std::memory_order::seq_cst);


Does this need a cache clear, or is all that taken care of by seq_cst?

That's actually a good question. I think because the other threads executing are loading data without atomics then there needs to be a dc cvau, <line>; dsb ish; to ensure visibility?

Yes, I've also done more research and you're right. We either need a barrier in the reading thread (bad), or a cache clean in the writing thread (good). The other parts should be good, since the more expensive ClearICache already includes those operations.

btw here's a good reference for cache and barrier instructions on arm: https://mariokartwii.com/armv8/ch30.html . (Certainly impressive that the Mario Kart community write clearer documentation on this than arm themselves.)

neobrain · 2025-04-22T13:15:55Z

Latest version fixes InstCountCI: This required carefully aligning the compiled code (4 bytes after a 16-byte boundary) so that 1) results continue to be deterministic 2) single-instruction NOP and FENCE blocks always compile to the same size. We'll want to make this behavior specific to InstCountCI instead of unconditionally applying it in JIT.cpp however.

The other InstCountCI changes are hopefully fine as is.

Sonicadvance1 · 2025-04-23T11:52:12Z

FEXCore/Source/Interface/Core/JIT/JIT.cpp

    CTX->ClearCodeCache(ThreadState);
  }

+  // TODO: Carefully aligned to ensure InstCountCI produces consistent results. Should be moved to CodeSizeValidation.


Is this TODO planned to be fixed in this PR or later?

I wasn't entirely sure if it's a good way to address the InstCountCI issue to begin with. If it is, I'll add a proper interface as part of this PR.

bylaws · 2025-04-23T12:33:23Z

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

Sonicadvance1

I'm not too concerned about the added NOPs. I was planning on fixing the padding problems for the JITTail struct alignment problems anyway. Which I'll be doing after I solve this Gravel CPUID problem.

Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.

Sonicadvance1 · 2025-04-24T23:19:40Z

Now that the InstcountCI stuff is fixed, the additional NOPS and be removed when this is rebased.

…tion

neobrain · 2025-04-25T08:00:14Z

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

I ran FEXOfflineCompiler on a set of codemaps generated during Steam startup + launching a few games, once with the Align16B() line and once without. The total output size went from 548 MB to 574 MB, so roughly a 5% increase.

EDIT: My local setup is borked, I'll need to fix and re-run the numbers.

neobrain · 2025-04-25T09:00:51Z

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

I ran FEXOfflineCompiler for the main executable of Ender Lilies, once with the Align16B() line and once without. The total output size went from 1060.8 MiB to 1062.2 MB, so roughly a 0.1% increase.

With multiblock disabled, the difference is more noticeable: 150.3 MiB to 153.1 MiB (1.9%).

Sonicadvance1 · 2025-04-25T09:03:08Z

Seems reasonable enough to me.

bylaws · 2025-04-25T09:30:27Z

Oh nice that's a lot better than I expected with MB

neobrain commented Apr 21, 2025

View reviewed changes

Sonicadvance1 reviewed Apr 21, 2025

View reviewed changes

FEXCore/Source/Interface/Core/JIT/JIT.cpp Show resolved Hide resolved

neobrain commented Apr 21, 2025

View reviewed changes

neobrain force-pushed the feature_unlink_threadsafe branch from fcfb5ba to ee7b866 Compare April 22, 2025 13:09

neobrain marked this pull request as draft April 22, 2025 13:16

neobrain marked this pull request as ready for review April 22, 2025 13:21

neobrain force-pushed the feature_unlink_threadsafe branch 3 times, most recently from ea5432a to 0cc689f Compare April 22, 2025 13:32

Sonicadvance1 reviewed Apr 23, 2025

View reviewed changes

Sonicadvance1 approved these changes Apr 23, 2025

View reviewed changes

Sonicadvance1 mentioned this pull request Apr 23, 2025

InstcountCI: Changes how code size is calculated #4531

Merged

neobrain added 2 commits April 25, 2025 09:21

JIT: Make code patching during (un-/)linking thread-safe

293d77d

JIT: Make memory write visible to other threads reading the same loca…

e89a913

…tion

neobrain force-pushed the feature_unlink_threadsafe branch from 0cc689f to e89a913 Compare April 25, 2025 07:47

Sonicadvance1 approved these changes Apr 25, 2025

View reviewed changes

Sonicadvance1 merged commit 99114e1 into FEX-Emu:main Apr 25, 2025
12 checks passed

neobrain mentioned this pull request Apr 29, 2025

Reduce JIT time by 25% by sharing code buffers between threads #4479

Merged

6 tasks

neobrain deleted the feature_unlink_threadsafe branch July 2, 2025 15:07

Conversation

neobrain commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Potential race conditions

Two threads simultaneously unlinking the same block

Two threads simultaneously linking the same block

One thread unlinking a block that is being executed in another thread

One thread linking a block that is being executed in another thread

One thread unlinking a block that is simultaneously being linked

Further resources

Uh oh!

neobrain Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

neobrain commented Apr 21, 2025

Uh oh!

Uh oh!

neobrain Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

Sonicadvance1 Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neobrain Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neobrain commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sonicadvance1 Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

neobrain Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

bylaws commented Apr 23, 2025

Uh oh!

Sonicadvance1 left a comment

Choose a reason for hiding this comment

Uh oh!

Sonicadvance1 commented Apr 24, 2025

Uh oh!

neobrain commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neobrain commented Apr 25, 2025

Uh oh!

Sonicadvance1 commented Apr 25, 2025

Uh oh!

bylaws commented Apr 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neobrain commented Apr 21, 2025 •

edited

Loading

Sonicadvance1 Apr 21, 2025 •

edited

Loading

neobrain Apr 22, 2025 •

edited

Loading

neobrain commented Apr 22, 2025 •

edited

Loading

neobrain commented Apr 25, 2025 •

edited

Loading