JIT: Make code patching during (un-/)linking thread-safe#4528
JIT: Make code patching during (un-/)linking thread-safe#4528Sonicadvance1 merged 2 commits intoFEX-Emu:mainfrom
Conversation
| } | ||
| emit.dc64(Frame->Pointers.Common.ExitFunctionLinker); | ||
|
|
||
| static void IndirectBlockDelinker(FEXCore::Core::CpuStateFrame* Frame, FEXCore::Context::ExitFunctionLinkData* Record) { |
There was a problem hiding this comment.
Dropped this function for simplicity, but on second thought perhaps it should be kept around since this only needs to be an atomic 64-bit write without extra cache flushes?
Then again not sure if this is hit often enough in practice for the overhead to matter. Dropping it in favor of the unified 128-bit patch makes it much easier to consider potential races.
|
Not sure about these InstCountCI changes - I'm guessing the script gets confused by the added padding? Otherwise, I don't see why the blocks would become smaller in some cases. |
|
|
||
| // Add de-linking handler | ||
| Thread->LookupCache->AddBlockLink(GuestRip, Record, IndirectBlockDelinker); | ||
| std::atomic_ref<uint64_t>(Record->HostBranch).store(HostCode, std::memory_order::seq_cst); |
There was a problem hiding this comment.
Does this need a cache clear, or is all that taken care of by seq_cst?
There was a problem hiding this comment.
That's actually a good question. I think because the other threads executing are loading data without atomics then there needs to be a dc cvau, <line>; dsb ish; to ensure visibility?
There was a problem hiding this comment.
Yes, I've also done more research and you're right. We either need a barrier in the reading thread (bad), or a cache clean in the writing thread (good). The other parts should be good, since the more expensive ClearICache already includes those operations.
btw here's a good reference for cache and barrier instructions on arm: https://mariokartwii.com/armv8/ch30.html . (Certainly impressive that the Mario Kart community write clearer documentation on this than arm themselves.)
fcfb5ba to
ee7b866
Compare
|
Latest version fixes InstCountCI: This required carefully aligning the compiled code (4 bytes after a 16-byte boundary) so that 1) results continue to be deterministic 2) single-instruction The other InstCountCI changes are hopefully fine as is. |
ea5432a to
0cc689f
Compare
| CTX->ClearCodeCache(ThreadState); | ||
| } | ||
|
|
||
| // TODO: Carefully aligned to ensure InstCountCI produces consistent results. Should be moved to CodeSizeValidation. |
There was a problem hiding this comment.
Is this TODO planned to be fixed in this PR or later?
There was a problem hiding this comment.
I wasn't entirely sure if it's a good way to address the InstCountCI issue to begin with. If it is, I'll add a proper interface as part of this PR.
|
Would be curious for stats on the size of the blowup from nops here (perhaps a bytemark run to confirm negligible performance differences) |
Sonicadvance1
left a comment
There was a problem hiding this comment.
I'm not too concerned about the added NOPs. I was planning on fixing the padding problems for the JITTail struct alignment problems anyway. Which I'll be doing after I solve this Gravel CPUID problem.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
Due to how jit block tail padding is working, there's no real good way to determine the true "implementation size" of an instruction without the backend being aware of wanting to investigate it. Trying to inject another instruction, or another IR operation actually subtly changes codegen in a way that gives invalid results. The only real way to get around this is to inject a known token in to the instruction stream as we `ExitFunction`. So inject a `udf #0x420f`, and change the scanning behaviour to find the first one and cut everything else off afterwards. This already scoops out some code in some game blocks that were accidentally landing ExitFunction code in the json. This also has been tested to work with FEX-Emu#4528 with its InstCountCI specific changes reverted. This means we don't need to play subtle padding tricks in the JIT to get the information we want in InstcountCI.
|
Now that the InstcountCI stuff is fixed, the additional NOPS and be removed when this is rebased. |
0cc689f to
e89a913
Compare
EDIT: My local setup is borked, I'll need to fix and re-run the numbers. |
I ran FEXOfflineCompiler for the main executable of Ender Lilies, once with the Align16B() line and once without. The total output size went from 1060.8 MiB to 1062.2 MB, so roughly a 0.1% increase. With multiblock disabled, the difference is more noticeable: 150.3 MiB to 153.1 MiB (1.9%). |
|
Seems reasonable enough to me. |
|
Oh nice that's a lot better than I expected with MB |
Overview
LSE2 platforms give us free (?) atomic 128-bit stores. We can use these to make the block (de-/)linking code apply their patches atomically, which allows blocks to be shared across threads without adding further thread-safety issues.
Credits for this idea go to @Sonicadvance1 .
I think all potential tearing issues are considered, but I'm not fully confident my atomicity/cache behavior assumptions are correct. I've done my best 😇
Details
For context, the unpatched code generally looks like this:
On LSE2 platforms (as specified at build-time via
-mcpu), DirectBlockDelinker will compile roughly to the following assembly:Without LSE2 enabled at build-time, less efficient code will be generated, which seems fine considering we'll be dropping support for those platforms soon.
Potential race conditions
There are some potential races to consider. Crucially, this concerns only robustness of code execution. The order of linking and unlinking itself requires external synchronization:
Two threads simultaneously unlinking the same block
The atomic 128-bit store in
DirectBlockDelinkerensures there is no tearing.Two threads simultaneously linking the same block
The threads will link the block to the same target, so either:
ldr x1, +8->b +0x1234)<address of ExitFunctionLinker>->address of target)Since the patches are applied atomically, there is no tearing.
One thread unlinking a block that is being executed in another thread
We don't have external synchronization to prevent this from happening, but we can look at potential tearing issues with regards to code execution.
The only critical state of the linking thread is at offset 4. If the unlinking thread's patch becomes visible at this point, the linking thread will jump to the unpatched target address. The code itself is robust, but this race condition requires external synchronization to be resolved.
One thread linking a block that is being executed in another thread
This is similar to the previous case, with the small nuance that the linking code will only patch the first 32-bits. The
blr x1instruction (and data) will remain untouched, so in case the other thread already loaded the value at0x00000008, execution will indeed be able to continue properly.The fallback case will patch the 64-bits at offset 0x8 instead, which is also fine since the load in the other thread is atomic due to being aligned.
One thread unlinking a block that is simultaneously being linked
I'm not sure if this can happen, but atomicity on both sides ensures that the resulting code is valid and tear-free even though the linking thread patches only a subset of the code. The result will be either of two code blocks, which is a race condition that requires external synchronization to be resolved.
Further resources
There's a good overview of cache/barrier instructions on ARM at https://mariokartwii.com/armv8/ch30.html .