Skip to content

JIT: Make code patching during (un-/)linking thread-safe#4528

Merged
Sonicadvance1 merged 2 commits intoFEX-Emu:mainfrom
neobrain:feature_unlink_threadsafe
Apr 25, 2025
Merged

JIT: Make code patching during (un-/)linking thread-safe#4528
Sonicadvance1 merged 2 commits intoFEX-Emu:mainfrom
neobrain:feature_unlink_threadsafe

Conversation

@neobrain
Copy link
Copy Markdown
Member

@neobrain neobrain commented Apr 21, 2025

Overview

LSE2 platforms give us free (?) atomic 128-bit stores. We can use these to make the block (de-/)linking code apply their patches atomically, which allows blocks to be shared across threads without adding further thread-safety issues.

Credits for this idea go to @Sonicadvance1 .

I think all potential tearing issues are considered, but I'm not fully confident my atomicity/cache behavior assumptions are correct. I've done my best 😇

Details

For context, the unpatched code generally looks like this:

0x00000000: ldr x1, +8
0x00000004: blr x1
0x00000008: <address of ExitFunctionLinker>
0x0000000c: <address of ExitFunctionLinker>
0x00000010: <GuestRIP>
0x00000014: <GuestRIP>

On LSE2 platforms (as specified at build-time via -mcpu), DirectBlockDelinker will compile roughly to the following assembly:

# load ExitFunctionLinker
ldr     x9, [x0, #2304]
# Load `ldr x1, +8` and `blr x1` encodings
mov     x10, #0x40                      // #64
movk    x10, #0x5800, lsl #16
mov     x0, x8
movk    x10, #0xd63f, lsl #48
# Apply 128-bit patch
stp     x10, x9, [x1, #-8]
# Flush caches
bl      109420 <__clear_cache@plt>

Without LSE2 enabled at build-time, less efficient code will be generated, which seems fine considering we'll be dropping support for those platforms soon.

Potential race conditions

There are some potential races to consider. Crucially, this concerns only robustness of code execution. The order of linking and unlinking itself requires external synchronization:

Two threads simultaneously unlinking the same block

The atomic 128-bit store in DirectBlockDelinker ensures there is no tearing.

Two threads simultaneously linking the same block

The threads will link the block to the same target, so either:

  • they will both atomically patch the branch instruction at the very beginning (ldr x1, +8 -> b +0x1234)
  • they will both atomically patch the 64-bits of data at offset 8 (<address of ExitFunctionLinker>->address of target)

Since the patches are applied atomically, there is no tearing.

One thread unlinking a block that is being executed in another thread

We don't have external synchronization to prevent this from happening, but we can look at potential tearing issues with regards to code execution.

The only critical state of the linking thread is at offset 4. If the unlinking thread's patch becomes visible at this point, the linking thread will jump to the unpatched target address. The code itself is robust, but this race condition requires external synchronization to be resolved.

One thread linking a block that is being executed in another thread

This is similar to the previous case, with the small nuance that the linking code will only patch the first 32-bits. The blr x1 instruction (and data) will remain untouched, so in case the other thread already loaded the value at 0x00000008, execution will indeed be able to continue properly.

The fallback case will patch the 64-bits at offset 0x8 instead, which is also fine since the load in the other thread is atomic due to being aligned.

One thread unlinking a block that is simultaneously being linked

I'm not sure if this can happen, but atomicity on both sides ensures that the resulting code is valid and tear-free even though the linking thread patches only a subset of the code. The result will be either of two code blocks, which is a race condition that requires external synchronization to be resolved.

Further resources

There's a good overview of cache/barrier instructions on ARM at https://mariokartwii.com/armv8/ch30.html .

}
emit.dc64(Frame->Pointers.Common.ExitFunctionLinker);

static void IndirectBlockDelinker(FEXCore::Core::CpuStateFrame* Frame, FEXCore::Context::ExitFunctionLinkData* Record) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped this function for simplicity, but on second thought perhaps it should be kept around since this only needs to be an atomic 64-bit write without extra cache flushes?

Then again not sure if this is hit often enough in practice for the overhead to matter. Dropping it in favor of the unified 128-bit patch makes it much easier to consider potential races.

@neobrain
Copy link
Copy Markdown
Member Author

Not sure about these InstCountCI changes - I'm guessing the script gets confused by the added padding? Otherwise, I don't see why the blocks would become smaller in some cases.


// Add de-linking handler
Thread->LookupCache->AddBlockLink(GuestRip, Record, IndirectBlockDelinker);
std::atomic_ref<uint64_t>(Record->HostBranch).store(HostCode, std::memory_order::seq_cst);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a cache clear, or is all that taken care of by seq_cst?

Copy link
Copy Markdown
Member

@Sonicadvance1 Sonicadvance1 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a good question. I think because the other threads executing are loading data without atomics then there needs to be a dc cvau, <line>; dsb ish; to ensure visibility?

Copy link
Copy Markdown
Member Author

@neobrain neobrain Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've also done more research and you're right. We either need a barrier in the reading thread (bad), or a cache clean in the writing thread (good). The other parts should be good, since the more expensive ClearICache already includes those operations.

btw here's a good reference for cache and barrier instructions on arm: https://mariokartwii.com/armv8/ch30.html . (Certainly impressive that the Mario Kart community write clearer documentation on this than arm themselves.)

@neobrain neobrain force-pushed the feature_unlink_threadsafe branch from fcfb5ba to ee7b866 Compare April 22, 2025 13:09
@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented Apr 22, 2025

Latest version fixes InstCountCI: This required carefully aligning the compiled code (4 bytes after a 16-byte boundary) so that 1) results continue to be deterministic 2) single-instruction NOP and FENCE blocks always compile to the same size. We'll want to make this behavior specific to InstCountCI instead of unconditionally applying it in JIT.cpp however.

The other InstCountCI changes are hopefully fine as is.

@neobrain neobrain marked this pull request as draft April 22, 2025 13:16
@neobrain neobrain marked this pull request as ready for review April 22, 2025 13:21
@neobrain neobrain force-pushed the feature_unlink_threadsafe branch 3 times, most recently from ea5432a to 0cc689f Compare April 22, 2025 13:32
CTX->ClearCodeCache(ThreadState);
}

// TODO: Carefully aligned to ensure InstCountCI produces consistent results. Should be moved to CodeSizeValidation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO planned to be fixed in this PR or later?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't entirely sure if it's a good way to address the InstCountCI issue to begin with. If it is, I'll add a proper interface as part of this PR.

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 23, 2025

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

Copy link
Copy Markdown
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too concerned about the added NOPs. I was planning on fixing the padding problems for the JITTail struct alignment problems anyway. Which I'll be doing after I solve this Gravel CPUID problem.

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Apr 23, 2025
Due to how jit block tail padding is working, there's no real good way
to determine the true "implementation size" of an instruction without
the backend being aware of wanting to investigate it.

Trying to inject another instruction, or another IR operation actually
subtly changes codegen in a way that gives invalid results. The only
real way to get around this is to inject a known token in to the
instruction stream as we `ExitFunction`.

So inject a `udf #0x420f`, and change the scanning behaviour to find the
first one and cut everything else off afterwards.

This already scoops out some code in some game blocks that were
accidentally landing ExitFunction code in the json.

This also has been tested to work with FEX-Emu#4528 with its InstCountCI
specific changes reverted.

This means we don't need to play subtle padding tricks in the JIT to get
the information we want in InstcountCI.
@Sonicadvance1
Copy link
Copy Markdown
Member

Now that the InstcountCI stuff is fixed, the additional NOPS and be removed when this is rebased.

@neobrain neobrain force-pushed the feature_unlink_threadsafe branch from 0cc689f to e89a913 Compare April 25, 2025 07:47
@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented Apr 25, 2025

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

I ran FEXOfflineCompiler on a set of codemaps generated during Steam startup + launching a few games, once with the Align16B() line and once without. The total output size went from 548 MB to 574 MB, so roughly a 5% increase.

EDIT: My local setup is borked, I'll need to fix and re-run the numbers.

@neobrain
Copy link
Copy Markdown
Member Author

Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences)

I ran FEXOfflineCompiler for the main executable of Ender Lilies, once with the Align16B() line and once without. The total output size went from 1060.8 MiB to 1062.2 MB, so roughly a 0.1% increase.

With multiblock disabled, the difference is more noticeable: 150.3 MiB to 153.1 MiB (1.9%).

@Sonicadvance1
Copy link
Copy Markdown
Member

Seems reasonable enough to me.

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 25, 2025

Oh nice that's a lot better than I expected with MB

@Sonicadvance1 Sonicadvance1 merged commit 99114e1 into FEX-Emu:main Apr 25, 2025
12 checks passed
@neobrain neobrain deleted the feature_unlink_threadsafe branch July 2, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants