Reduce JIT time by 25% by sharing code buffers between threads by neobrain · Pull Request #4479 · FEX-Emu/FEX

neobrain · 2025-04-02T21:45:43Z

Overview

To ease state management, FEX currently dedicates a separate CodeBuffer to each thread for storing the output of JIT binary recompilation. Effectively, this means lots of code must be compiled multiple times, particularly during startup and loadscreens of an application. This PR fixes this by sharing CodeBuffers between threads wherever possible, drastically reducing JIT time and mitigating microstutters (see below for benchmarks).

The broader goal here is that CodeBuffer memory should never be discarded unless strictly necessary, establishing the invariant "every guest block is compiled exactly once". This will also prepare us for effective on-disk code caching (which will further reduce startup times).

Implementation: Partially persistent data structures

... partially what now?

This is an idea from functional programming to ensure thread-safe data access: Trying to modify an object will branch its state into a new version for the active thread but preserving the old version as a read-only copy in other threads. This can be implemented without CPU overhead as long as write accesses are mutex-protected.

This PR turns CodeBuffer into such a partially persistent data structure. In practice this means:

Exactly one CodeBuffer is now designated as "active", which means data can be appended to it
Lossy modifications to the active CodeBuffer will not invalidate any data in use by other threads (which is what enables save CodeBuffer sharing across threads)
Instead, such lossy modifications trigger a new "version" of the data in the modifying thread. Old versions of the CodeBuffer persist as read-only data for use by the other threads.
The other threads can update their version of the CodeBuffer. This will decrease the reference count and eventually trigger deallocation of the old version

With the code in this PR, starting a new CodeBuffer version will wipe its entire contents (similar to today's semantics), but that's an implementation detail. In the future we can carry over old data to the new CodeBuffer using relocations, which would still be in line with the persistence model.

Measurements

Preliminary testing yielded the following numbers (gathered from Tracy):

	Mirror's Edge¹	God of War²	Yooka-Laylee³	Hollow Knight⁴
JIT invocations (million)	6.9⇨3.1(⇨2.9⁵)	0.97⇨0.72(⇨0.56)	0.80⇨0.72(⇨0.53)	0.71⇨6.8(⇨4.1)
Total JIT time (sec)	20.9⇨15.5(⇨13.0)	10.3⇨8.4(⇨5.1)	6.3⇨5.1(⇨3.9)	5.0⇨4.4(⇨2.7)
JIT time savings (%)	26% (38%)	19% (50%)	19% (38%)	11% (47%)

I also measured the Steam startup time (up to the completed first render of the Library view). Since we can't profile JIT time across multiple processes, I had to measure the total startup time. The results are perhaps weakly favorable towards this PR (best measurement 35s⇨33s), but the fluctuation is too strong to say with confidence.

Implementation notes

Events that trigger a new CodeBuffer version

Running out of CodeBuffer space
TSO auto-migration (i.e. a one-time event upon starting the second application thread)
Single-stepping

Notably, switching between threads that compile new code does not trigger a new version. However, modifying accesses must be serialized, so a global lock is used when running the JIT backend (see next point).

Global JIT lock

This approach to sharing code buffers requires all JIT (backend) compilation to be logically single-threaded, so a global CodeBufferWriteMutex mutex is used. This incurs some contention during startup of heavily multithreaded applications. In Mirror's Edge, I measured a lock contention of 9% (5%⁵) of total time spent in JIT (worst case was 14% in smaller time windows). That overhead is included in the net improvements above, so this indicates practical room for further improvement.

It seems unlikely that even strongly multithreaded applications would suffer from this: Whereas now threads are merely fighting for a global lock, the app threads would've previously wasted time recompiling the same code over and over again.

In the future, the lock contention can be minimized by compiling to a short-lived thread-local buffer and relocating the resulting code to the actual CodeBuffer. Only the relocation would need to be mutex-protected then.

Interaction with L3 LookupCache

We must rethink the semantics of the L3 LookupCache: It's not really a cache but rather the source for mapping guest addresses to host code pointers. Subsequently, the PR decouples it from the L1/L2 LookupCaches, renames it to GuestToHostMap, and embeds it into CodeBuffer. This means the same versioning effect applies to all data contained in GuestToHostMap. Overall, this makes sense since without sharing this mapping, we'd use the same CodeBuffer but we wouldn't know where to find blocks already compiled on other threads.

Since this data was assumed to be thread-local previously, care must be taken when invalidating it (e.g. for self-modifying code). I checked existing invalidation code paths and verified they are either idempotent (i.e. they run the same action multiple times without order-dependence or other change in effect) or already unused/broken today. ~~The SMC-full mode is an exception that needs to be fixed.~~ The latter group includes SMC-full mode, which has been non-functional on main for an unknown amount of time.

Signal handlers

Currently, FEX has dedicated CodeBuffer code to cope with signal handlers that must extend the lifetime of a CodeBuffer. This mechanism is now based on persistence as well, which makes it much more straightforward. Signal handlers will now simply increase the refcount of a CodeBuffer so that it doesn't get deallocated too early.

Stale CodeBuffer versions

Old CodeBuffer versions only get deallocated when all threads that reference it cooperatively update to a newer CodeBuffer. This may never happen in some cases (imagine e.g. a background thread spawned at start that enters a syscall for an event that's never raised until program end). Due to geometric growth, the combined worst-case memory overhead of all such CodeBuffers is the size of the active CodeBuffer.

It's unclear if this can be fully resolved. However, we can largely mitigate this by minimizing the number of events that trigger version branching in the first place:

Use a larger initial size of the first CodeBuffer (possibly using a heuristic based on the program size itself)
Maybe more?

TODO

Clean up
~~Fix SMC-full mode~~ (actually broken on main)
Revisit single-stepping code (ThreadManager::Step)
~~WoA support?~~ (nothing special to do)
Verify invalidation logic for thread-local LookupCaches is functionally equivalent
~~Deallocate previously compiled code on TSO auto-migration~~ (will just carry around dead code until a CodeBuffer clear for now)

Future

See #4514 for ideas to mitigate single-threaded compilation and other follow-up optimizations.

Tested first 100s of the game, trying to gain control of the character as quickly as possible ↩
Tested first 60s of the game without entering any inputs ↩
Tested first 45s of the game, trying to enter one level in my save file as quickly as possible and waiting for the camera to stop moving ↩
Tested first 36s of the game, loading my last save file and jumping once ↩
Potential for follow-up change to carry over CodeBuffer data on resize ↩ ↩²

FEXCore/Source/Interface/Core/LookupCache.h

bylaws · 2025-04-03T08:59:49Z

Have you considered allowing for some holes in the codebuffer during multithreaded compilation? Perhaps I'm missing a reason for it not to work but something like

LocalCodeBufferOffset= 0
locked{
if (CurThreadWriterMaxSize) {
   CodeBufferOffset += CurThreadWriterMaxSize;
}
LocalCodeBufferOffset= CodeBufferOffset
CurThreadWriterMaxSize = SSANodes * 12

}

<Compile from LocalCodeBufferOffset>

locked{
if (LocalCodeBufferOffset == CodeBufferOffset)  CurThreadWriterMaxSize=0
}

Invalidation I suppose would need a barrier now but doesn't intuitively seem complex.

neobrain · 2025-04-03T09:31:46Z

Have you considered allowing for some holes in the codebuffer during multithreaded compilation?

The size estimates would be pretty rough - not sure how large the gaps we'd leave would be on average (wasting committed memory and possibly impacting host icache efficiency). When implementing the idea of relocation post-compile, this approach sounds better than using temporary storage though.

bylaws · 2025-04-03T09:45:39Z

The size estimates would be pretty rough - not sure how large the gaps we'd leave would be on average (wasting committed memory and possibly impacting host icache efficiency). When implementing the idea of relocation post-compile, this approach sounds better than using temporary storage though.

Right - I wonder if this would be worth instrumenting just to get some numbers

neobrain · 2025-04-18T07:53:54Z

Haven't encountered any issues in 2 weeks of testing, so let's land this!

Sonicadvance1 · 2025-04-18T07:56:46Z

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

bylaws · 2025-04-18T07:59:28Z

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

I had this issue on master, not this pr

Sonicadvance1 · 2025-04-18T08:02:16Z

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

I had this issue on master, not this pr

So this PR...doesn't have a race between multiple threads trying to execute the same code and link/unlink at the same time?

This approach to sharing code buffers requires all JIT (backend) compilation to be logically single-threaded

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

bylaws · 2025-04-18T08:08:07Z

So this PR...doesn't have a race between multiple threads trying to execute the same code and link/unlink at the same time?

The race with delinking happens even on master, though actually you're right that a linking one would be new - but is avoided with the global lock in ExitFunctionLink.

neobrain · 2025-04-18T08:23:16Z

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

Yeah we all talked past each other that meeting, I think. I hadn't seen any issues (otherwise I would've noted it down here) and ByLaws was referring to an issue on main.

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

Would've been nice to bring this up without letting this PR labeled as RFC sit for 2 weeks, but sure let me know if you find anything different from my measurements on this. It's plausible that single-threaded compilation is just as effective as the previous behavior of parallel-but-redundant compilation.

Sonicadvance1 · 2025-04-18T18:28:11Z

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

Yeah we all talked past each other that meeting, I think. I hadn't seen any issues (otherwise I would've noted it down here) and ByLaws was referring to an issue on main.

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

Would've been nice to bring this up without letting this PR labeled as RFC sit for 2 weeks, but sure let me know if you find anything different from my measurements on this. It's plausible that single-threaded compilation is just as effective as the previous behavior of parallel-but-redundant compilation.

So this introduces a known race in block linking that can result in a tear in visibility when one thread is patching the code and another thread is executing it.
I would very much want this to be fixed before this is merged. Make sure the matrix of four race conditions (link-direct, link-indirect, unlink-direct, unlink indirect) has its data that is getting modified is aligned to 16-byte granule and do a 128-bit atomic store on it.

Also with some testing shows the global JIT lock does make stutters worse in games that hammer compilation at the same time like RUINER-Linux, and Cyberpunk 2077. RUINER even just lost 10FPS of perf, dropping from 70FPS to 60FPS. So having the threads JIT in a transient buffer and then copy to share would be preferred. A little bit of additional work that gets thrown away if multiple threads are compiling the same code is preferred to stuttering due to job queue systems hitting the same code and stalling.

Still need to throw some more UE5 games at this to see how they behave.

bylaws · 2025-04-19T18:25:41Z

FEXCore/Source/Interface/Core/LookupCache.h

+  void Erase(FEXCore::Core::CpuStateFrame* Frame, uint64_t Address, const LockToken&) {
+    // Sever any links to this block
+    auto lower = BlockLinks->lower_bound({Address, nullptr});
+    auto upper = BlockLinks->upper_bound({Address, reinterpret_cast<FEXCore::Context::ExitFunctionLinkData*>(UINTPTR_MAX)});


Perhaps numeric_limits would be cleaner here?

This code just moved around, but I also don't think numeric_limits works for pointers (since this UINTPTR_MAX cast is technically UB anyway).

FEXCore/Source/Interface/Core/CPUBackend.h

bylaws · 2025-04-19T18:34:06Z

FEXCore/Source/Interface/Core/CPUBackend.cpp

-    FEXCore::Allocator::VirtualFree(Buffer.Ptr, Buffer.Size);
+  fextl::shared_ptr<CodeBuffer> CodeBufferManager::GetCurrentCodeBuffer() {
+    if (!Latest) {
+      static constexpr size_t INITIAL_CODE_SIZE = 1024 * 1024 * 16;


Probably cleaner to keep such an important constant at the top of the file still

bylaws · 2025-04-19T18:39:25Z

FEXCore/Source/Interface/Core/CPUBackend.h

+
+    fextl::shared_ptr<CodeBuffer> GetCurrentCodeBuffer();
+
+    fextl::shared_ptr<CodeBuffer> Latest;


This seems best private, then GetCurrentCodeBuffer can always be used (maybe also drop GetCurrentCodeBufferSize? GetCurrentCodeBuffer()->Size is equally as short)

Actually going the opposite way here, I think. Since Core.cpp doesn't use this interface anymore, we can just make all state public and access variables directly. CodeBuffers.Latest isn't quite the mouthful that CodeBuffers.GetCurrentCodeBuffer() was either.

Will drop GetCurrentCodeBufferSize however.

Err, nevermind, I wasn't thinking. GetCurrentCodeBuffer isn't trivial, so we can't just drop it. Will follow your suggestion after all then.

bylaws · 2025-04-19T18:44:07Z

FEXCore/Source/Interface/Core/CPUBackend.cpp

+      CurrentCodeBuffer = CodeBuffers.GetCurrentCodeBuffer();
+    } else {
+      auto NewCodeBufferSize = CodeBuffers.GetCurrentCodeBufferSize();
+      NewCodeBufferSize = std::min<size_t>(NewCodeBufferSize * 2.0, MaxCodeSize);


Any particular reason for changing this to 2.0?

It's not a critically important change, but in principle this should reduce runtime use of committed memory. Here's the reasoning:

Two conflicting metrics could be considered here: Size of the current code buffer, and combined size of all CodeBuffers. The former metric prefers smaller factors (increasing only as much as necessary), but it's also less important since any unused excess memory is uncommitted (on Linux, anyway). The latter metric measures overhead of already committed memory and prefers bigger factors (since that reduces the total number of CodeBuffer versions needed due to regrowth, and since it reduces their individual relative size compared to the current CodeBuffer).

The previous factor 1.5 wastes up to twice the size of the latest CodeBuffer (1/1.5+1/(1.5**2)+1/(1.5**3)+...). The factor 2.0 in contrast will never waste more than the size of the latest CodeBuffer itself. Higher factors would theoretically fare even better, but my intuition says that it would just be diminishing returns for the average case (where most CodeBuffer versions are eventually discarded and where the latest CodeBuffer will contain some uncommitted memory).

Right, makes sense - might wanna drop the .0 now its an integer

FEXCore/Source/Interface/Core/CPUBackend.h

bylaws · 2025-04-19T19:18:55Z

FEXCore/Source/Interface/Core/CPUBackend.h

-    // This is the current code buffer that we are tracking
-    CodeBuffer* CurrentCodeBuffer {};
+    // This is the code buffer actively used by this thread
+    fextl::shared_ptr<CodeBuffer> CurrentCodeBuffer;


Maybe rename to ActiveCodeBuffer

I think CurrentCodeBuffer is more fitting. Active has a ring of "being dynamic" to it, which would be misleading here: The current CodeBuffer may be read-only and hence will need to be updated when new code is compiled.

This is a weak opinion though, if you're convinced "active" is better I can change it.

I'll update the comment though: // This is the code buffer containing the main code under execution by this thread

FEXCore/Source/Interface/Core/Core.cpp

neobrain · 2025-04-29T08:31:26Z

Thanks for testing!

So this introduces a known race in block linking that can result in a tear in visibility when one thread is patching the code and another thread is executing it. I would very much want this to be fixed before this is merged. Make sure the matrix of four race conditions (link-direct, link-indirect, unlink-direct, unlink indirect) has its data that is getting modified is aligned to 16-byte granule and do a 128-bit atomic store on it.

This has been addressed in #4528 now.

Also with some testing shows the global JIT lock does make stutters worse in games that hammer compilation at the same time like RUINER-Linux, and Cyberpunk 2077. RUINER even just lost 10FPS of perf, dropping from 70FPS to 60FPS. So having the threads JIT in a transient buffer and then copy to share would be preferred. A little bit of additional work that gets thrown away if multiple threads are compiling the same code is preferred to stuttering due to job queue systems hitting the same code and stalling.

RUINER with its >700 thread spawns per second is a rather pathological case, but making stutters in Cyberpunk noticeably worse would indeed be an undesirable side effect.

I'll look into re-enabling parallel compilation but will work on advancing the code caching work a bit more first. Putting this back to draft mode for now.

neobrain · 2025-05-05T10:51:11Z

I realized we can hold the CodeBufferWriteMutex for a much shorter time, as only the actual code emission and offset update must be protected by it. This reduced lock contention in Yooka Laylee by 66%. Compiling to a temporary buffer can yield an additional 11% reduction (from the baseline). I implemented a prototype for this (see b6af251) but the design decisions are nontrivial so I would rather do this in a follow-up PR.

@Sonicadvance1 Could you retest Cyberpunk with the new changes? If the stuttering persists, can you quantify how much worse it is? Did you find any other reasonable games that were impacted negatively?

Ultimately there is a trade-off between different types of workloads: Stuttering might be reduced in only some games while battery drain should be reduced across the board. Hopefully this latest update shifts the trade-off closer towards the optimum, though, and on-disk code caching should more than make up the few regressions eventually.

bylaws · 2025-05-05T14:10:04Z

Quite interesting how slow ICache flush takes if it meaningfully increases contention more than the actual emission being under the lock. I wonder if on some other occasion after this PR its a good idea to profile that.

neobrain · 2025-05-09T16:14:48Z

Compiling to a temporary buffer can yield an additional 11% reduction (from the baseline). I implemented a prototype for this (see b6af251) but the design decisions are nontrivial so I would rather do this in a follow-up PR.

Implementing the same approach but using ThreadPoolAllocator to manage the temporary buffer improved the reduction to around 20% (from the baseline). It also resolves most of the design questions, so I implemented it as part of this PR now after all. Every thread will compile to a temporary buffer now, and the global lock is only taken to move the temporary output to the current CodeBuffer. (~~I still need to hunt down the source for the one unit test failure, but it's all working fine from real-world testing.~~)

Unfortunately we have to take a hit on space overhead for this when multiple threads compile the same block: Preventing this would require checking the LookupCache while CodeBufferWriteMutex is still locked, however that in turn requires waiting for the LookupCache mutex to become available (which increases overall lock contention again). We should probably look into maintaining some "in-progress" queue instead that threads can check before even attempting to compile in the first place, but this is out of scope for this PR.

During testing this also revealed that our codegen size estimate has been too conservative, which was a latent source of bugs/crashes.

Art-Chen · 2025-05-26T02:34:37Z

testing with latest head of this mr on Wine ARM64EC, i meet hung when :

cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

fps drop on GTA 5

even if rebase this mr to the main branch, still the same.

maybe need some check on Windows ARM64EC?

neobrain · 2025-05-26T07:14:35Z

Thanks for testing, @Art-Chen ! Just some clarifying questions to see about reproducing this.

testing with latest head of this mr on Wine ARM64EC, i meet hung when :
cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

fps drop on GTA 5

How much did fps drop?

Wine ARM64EC

Are you using upstream Wine or ByLaws's fork?

Art-Chen · 2025-05-26T08:25:14Z

testing with latest head of this mr on Wine ARM64EC, i meet hung when :
cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

Every time.

How much did fps drop?

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Are you using upstream Wine or ByLaws's fork?

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

neobrain · 2025-05-26T08:43:26Z

Every time.

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

Interesting, though cpu-z is crashing for me with upstream Wine (git) and upstream FEX (2505) already. Could you double-check that the issues are fixed when going back to FEX d358768 ?

Art-Chen · 2025-05-26T09:59:12Z

though cpu-z is crashing for me with upstream Wine (git) and upstream FEX (2505) already

CPU-Z crash is known issue and need an workaround patch for wine to get it to run.

You can just try running MiSide to test it.
https://store.steampowered.com/app/2527500/_MiSide/

bylaws · 2025-05-26T10:13:09Z

testing with latest head of this mr on Wine ARM64EC, i meet hung when :
cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

Every time.

How much did fps drop?

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Are you using upstream Wine or ByLaws's fork?

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

There's a known bug here Ifrom a couple weeks back 'm not sure I pushed a fix toy upstream-arm64ec branch for. Might also be missing suspend patches. Could you use the arm64ec-10 branch please?

Art-Chen · 2025-05-26T10:30:38Z

Might also be missing suspend patches

I tried to pick up the suspend related patches from your branch, but it will cause cef based application hang when using client sync(esync/fsync). so i reverted it.

Could you use the arm64ec-10 branch please?

i'll try later. and could you telling more detail about this issue? and which patch addressed this issue? Thanks!

Sonicadvance1 · 2025-05-30T16:46:20Z

Needs a rebase but I'm now looking at this.

Sonicadvance1 · 2025-05-30T18:11:48Z

CyberPunk 2077 is worse off with this PR because of its consistent work threads code invalidation and jitting causing significant per-frame contention, but I don't think it is a blocker.

UE5 games so far don't seem worse off, might be slightly better off even since their worker threads would end up randomly jitting new code that a different thread would have executed prior.

Sonicadvance1 · 2025-05-30T18:32:09Z

Recent changes with this seems to improve RUINER on native Linux. Around 65FPS with this PR versus around 30 on main due to all the VLC jitting overhead.

This data isn't really a cache, since the JIT is directly responsible of writing its contents. Instead it be considered the source to populate the L1/L2 caches from. Furthermore, splitting off this data allows it to be shared across threads in the future without affecting L1/L2 caches.

This is required for sharing CodeBuffers between threads anyway, but it also allows use of the constructor/destructor to manage memory automatically.

… interface

This is changes the interface of CodeBuffer to that of a partially persistent data structure based on reference counting: - Exactly one CodeBuffer is now designated as "active", which means data can be *appended* to it - Lossy modifications to the active CodeBuffer will not invalidate any data in use by other threads, which enables save sharing across threads - Instead, such lossy modifications trigger a new "version" of the data in the modifying thread. Old versions of the CodeBuffer persist as read-only data for use by the other threads. - The other threads can update their version of the CodeBuffer. This will decrease the reference count and eventually trigger deallocation of the old version

This further reduces lock contention by skipping the backend phase in case another thread raced the active one for the same block.

The previous bound was exceeded during Steam startup before.

This no longer works since the JIT output is now relocated before execution.

neobrain · 2025-06-02T16:38:11Z

Forgot replying to this before, but thanks for re-testing! I'll note down Cyberpunk for future benchmarking since there's some more low-hanging fruit for optimization here.

bylaws reviewed Apr 2, 2025

View reviewed changes

FEXCore/Source/Interface/Core/LookupCache.h Outdated Show resolved Hide resolved

neobrain force-pushed the feature_codebuffer_sharing branch from 979d4ff to b210ad8 Compare April 10, 2025 19:14

neobrain mentioned this pull request Apr 17, 2025

CodeBuffer management long-term planning #4514

Open

6 tasks

neobrain force-pushed the feature_codebuffer_sharing branch from 38b52e4 to 41c44a4 Compare April 18, 2025 07:39

neobrain marked this pull request as ready for review April 18, 2025 07:52

neobrain changed the title ~~RFC: Reduce JIT time by 25% by sharing code buffers between threads~~ Reduce JIT time by 25% by sharing code buffers between threads Apr 18, 2025

bylaws requested changes Apr 19, 2025

View reviewed changes

neobrain mentioned this pull request Apr 25, 2025

mprotect last page of CodeBuffer #4540

Merged

neobrain marked this pull request as draft April 29, 2025 08:31

This was referenced May 2, 2025

Linux/SMCTracking: Stop calling mprotect on a memory region times the number of threads #4547

Merged

Support multiple entrypoints into a multiblock and executable permission tracking #4474

Merged

neobrain force-pushed the feature_codebuffer_sharing branch from 41c44a4 to 7528a28 Compare May 5, 2025 10:32

neobrain mentioned this pull request May 9, 2025

ThreadPoolAllocator: Add support for updating the size of managed data #4566

Merged

neobrain force-pushed the feature_codebuffer_sharing branch 3 times, most recently from 2a749cd to cc92250 Compare May 9, 2025 15:40

neobrain added 14 commits June 1, 2025 22:42

fextl: Add shared_ptr and make_shared

1837aaa

LookupCache: Prefer empty() over a size check

3422448

CPUBackend: Manage CodeBuffers using shared_ptr

3376587

This is required for sharing CodeBuffers between threads anyway, but it also allows use of the constructor/destructor to manage memory automatically.

Move AllocateNewCodeBuffer from CPUBackend to a new CodeBufferManager…

ab51958

… interface

Rename CodeBufferManager reference

a503b5e

FEXCore/JIT: Extend LookupCache lock to all of ExitFunctionLink

8481c79

Core: Re-check LookupCache before running compiler backend

0dfefe9

This further reduces lock contention by skipping the backend phase in case another thread raced the active one for the same block.

Core: Minimize the time CodeBufferWriteMutex is held

95791a9

JIT: Re-enable parallel compilation by compiling to a temporary buffer

4bbaef5

JIT: Increase estimate for CodeBuffer space use

7c93bec

The previous bound was exceeded during Steam startup before.

Arm64Emitter: Disable PC-relative constant encoding

a9a6a64

This no longer works since the JIT output is now relocated before execution.

CPUBackend: Clean up CodeBuffer size limits

a792dd0

neobrain force-pushed the feature_codebuffer_sharing branch from d46e431 to a792dd0 Compare June 1, 2025 20:48

Sonicadvance1 approved these changes Jun 1, 2025

View reviewed changes

Sonicadvance1 merged commit e794584 into FEX-Emu:main Jun 2, 2025
12 checks passed

neobrain deleted the feature_codebuffer_sharing branch July 2, 2025 15:07

neobrain mentioned this pull request Sep 3, 2025

Core: Drop support for TSO auto migration #4830

Merged


		fextl::shared_ptr<CodeBuffer> GetCurrentCodeBuffer();

		fextl::shared_ptr<CodeBuffer> Latest;

Conversation

neobrain commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation: Partially persistent data structures

Measurements

Implementation notes

Events that trigger a new CodeBuffer version

Global JIT lock

Interaction with L3 LookupCache

Signal handlers

Stale CodeBuffer versions

TODO

Future

Footnotes

Uh oh!

Uh oh!

bylaws commented Apr 3, 2025

Uh oh!

neobrain commented Apr 3, 2025

Uh oh!

bylaws commented Apr 3, 2025

Uh oh!

neobrain commented Apr 18, 2025

Uh oh!

Sonicadvance1 commented Apr 18, 2025

Uh oh!

bylaws commented Apr 18, 2025

Uh oh!

Sonicadvance1 commented Apr 18, 2025

Uh oh!

bylaws commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neobrain commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sonicadvance1 commented Apr 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neobrain Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neobrain commented Apr 29, 2025

Uh oh!

neobrain commented May 5, 2025

Uh oh!

bylaws commented May 5, 2025

Uh oh!

neobrain commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Art-Chen commented May 26, 2025

Uh oh!

neobrain commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

neobrain commented Apr 2, 2025 •

edited

Loading

bylaws commented Apr 18, 2025 •

edited

Loading

neobrain commented Apr 18, 2025 •

edited

Loading

neobrain Apr 21, 2025 •

edited

Loading

neobrain commented May 9, 2025 •

edited

Loading

neobrain commented May 26, 2025 •

edited

Loading