Skip to content

Reduce JIT time by 25% by sharing code buffers between threads#4479

Merged
Sonicadvance1 merged 14 commits intoFEX-Emu:mainfrom
neobrain:feature_codebuffer_sharing
Jun 2, 2025
Merged

Reduce JIT time by 25% by sharing code buffers between threads#4479
Sonicadvance1 merged 14 commits intoFEX-Emu:mainfrom
neobrain:feature_codebuffer_sharing

Conversation

@neobrain
Copy link
Copy Markdown
Member

@neobrain neobrain commented Apr 2, 2025

Overview

To ease state management, FEX currently dedicates a separate CodeBuffer to each thread for storing the output of JIT binary recompilation. Effectively, this means lots of code must be compiled multiple times, particularly during startup and loadscreens of an application. This PR fixes this by sharing CodeBuffers between threads wherever possible, drastically reducing JIT time and mitigating microstutters (see below for benchmarks).

The broader goal here is that CodeBuffer memory should never be discarded unless strictly necessary, establishing the invariant "every guest block is compiled exactly once". This will also prepare us for effective on-disk code caching (which will further reduce startup times).

Implementation: Partially persistent data structures

... partially what now?

This is an idea from functional programming to ensure thread-safe data access: Trying to modify an object will branch its state into a new version for the active thread but preserving the old version as a read-only copy in other threads. This can be implemented without CPU overhead as long as write accesses are mutex-protected.

This PR turns CodeBuffer into such a partially persistent data structure. In practice this means:

  • Exactly one CodeBuffer is now designated as "active", which means data can be appended to it
  • Lossy modifications to the active CodeBuffer will not invalidate any data in use by other threads (which is what enables save CodeBuffer sharing across threads)
  • Instead, such lossy modifications trigger a new "version" of the data in the modifying thread. Old versions of the CodeBuffer persist as read-only data for use by the other threads.
  • The other threads can update their version of the CodeBuffer. This will decrease the reference count and eventually trigger deallocation of the old version

With the code in this PR, starting a new CodeBuffer version will wipe its entire contents (similar to today's semantics), but that's an implementation detail. In the future we can carry over old data to the new CodeBuffer using relocations, which would still be in line with the persistence model.

Measurements

Preliminary testing yielded the following numbers (gathered from Tracy):

Mirror's Edge1 God of War2 Yooka-Laylee3 Hollow Knight4
JIT invocations (million) 6.9⇨3.1(⇨2.95) 0.97⇨0.72(⇨0.56) 0.80⇨0.72(⇨0.53) 0.71⇨6.8(⇨4.1)
Total JIT time (sec) 20.9⇨15.5(⇨13.0) 10.3⇨8.4(⇨5.1) 6.3⇨5.1(⇨3.9) 5.0⇨4.4(⇨2.7)
JIT time savings (%) 26% (38%) 19% (50%) 19% (38%) 11% (47%)

I also measured the Steam startup time (up to the completed first render of the Library view). Since we can't profile JIT time across multiple processes, I had to measure the total startup time. The results are perhaps weakly favorable towards this PR (best measurement 35s⇨33s), but the fluctuation is too strong to say with confidence.

Implementation notes

Events that trigger a new CodeBuffer version

  • Running out of CodeBuffer space
  • TSO auto-migration (i.e. a one-time event upon starting the second application thread)
  • Single-stepping

Notably, switching between threads that compile new code does not trigger a new version. However, modifying accesses must be serialized, so a global lock is used when running the JIT backend (see next point).

Global JIT lock

This approach to sharing code buffers requires all JIT (backend) compilation to be logically single-threaded, so a global CodeBufferWriteMutex mutex is used. This incurs some contention during startup of heavily multithreaded applications. In Mirror's Edge, I measured a lock contention of 9% (5%5) of total time spent in JIT (worst case was 14% in smaller time windows). That overhead is included in the net improvements above, so this indicates practical room for further improvement.

It seems unlikely that even strongly multithreaded applications would suffer from this: Whereas now threads are merely fighting for a global lock, the app threads would've previously wasted time recompiling the same code over and over again.

In the future, the lock contention can be minimized by compiling to a short-lived thread-local buffer and relocating the resulting code to the actual CodeBuffer. Only the relocation would need to be mutex-protected then.

Interaction with L3 LookupCache

We must rethink the semantics of the L3 LookupCache: It's not really a cache but rather the source for mapping guest addresses to host code pointers. Subsequently, the PR decouples it from the L1/L2 LookupCaches, renames it to GuestToHostMap, and embeds it into CodeBuffer. This means the same versioning effect applies to all data contained in GuestToHostMap. Overall, this makes sense since without sharing this mapping, we'd use the same CodeBuffer but we wouldn't know where to find blocks already compiled on other threads.

Since this data was assumed to be thread-local previously, care must be taken when invalidating it (e.g. for self-modifying code). I checked existing invalidation code paths and verified they are either idempotent (i.e. they run the same action multiple times without order-dependence or other change in effect) or already unused/broken today. The SMC-full mode is an exception that needs to be fixed. The latter group includes SMC-full mode, which has been non-functional on main for an unknown amount of time.

Signal handlers

Currently, FEX has dedicated CodeBuffer code to cope with signal handlers that must extend the lifetime of a CodeBuffer. This mechanism is now based on persistence as well, which makes it much more straightforward. Signal handlers will now simply increase the refcount of a CodeBuffer so that it doesn't get deallocated too early.

Stale CodeBuffer versions

Old CodeBuffer versions only get deallocated when all threads that reference it cooperatively update to a newer CodeBuffer. This may never happen in some cases (imagine e.g. a background thread spawned at start that enters a syscall for an event that's never raised until program end). Due to geometric growth, the combined worst-case memory overhead of all such CodeBuffers is the size of the active CodeBuffer.

It's unclear if this can be fully resolved. However, we can largely mitigate this by minimizing the number of events that trigger version branching in the first place:

  • Use a larger initial size of the first CodeBuffer (possibly using a heuristic based on the program size itself)
  • Maybe more?

TODO

  • Clean up
  • Fix SMC-full mode (actually broken on main)
  • Revisit single-stepping code (ThreadManager::Step)
  • WoA support? (nothing special to do)
  • Verify invalidation logic for thread-local LookupCaches is functionally equivalent
  • Deallocate previously compiled code on TSO auto-migration (will just carry around dead code until a CodeBuffer clear for now)

Future

See #4514 for ideas to mitigate single-threaded compilation and other follow-up optimizations.

Footnotes

  1. Tested first 100s of the game, trying to gain control of the character as quickly as possible

  2. Tested first 60s of the game without entering any inputs

  3. Tested first 45s of the game, trying to enter one level in my save file as quickly as possible and waiting for the camera to stop moving

  4. Tested first 36s of the game, loading my last save file and jumping once

  5. Potential for follow-up change to carry over CodeBuffer data on resize 2

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 3, 2025

Have you considered allowing for some holes in the codebuffer during multithreaded compilation? Perhaps I'm missing a reason for it not to work but something like

LocalCodeBufferOffset= 0
locked{
if (CurThreadWriterMaxSize) {
   CodeBufferOffset += CurThreadWriterMaxSize;
}
LocalCodeBufferOffset= CodeBufferOffset
CurThreadWriterMaxSize = SSANodes * 12

}

<Compile from LocalCodeBufferOffset>

locked{
if (LocalCodeBufferOffset == CodeBufferOffset)  CurThreadWriterMaxSize=0
}

Invalidation I suppose would need a barrier now but doesn't intuitively seem complex.

@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented Apr 3, 2025

Have you considered allowing for some holes in the codebuffer during multithreaded compilation?

The size estimates would be pretty rough - not sure how large the gaps we'd leave would be on average (wasting committed memory and possibly impacting host icache efficiency). When implementing the idea of relocation post-compile, this approach sounds better than using temporary storage though.

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 3, 2025

The size estimates would be pretty rough - not sure how large the gaps we'd leave would be on average (wasting committed memory and possibly impacting host icache efficiency). When implementing the idea of relocation post-compile, this approach sounds better than using temporary storage though.

Right - I wonder if this would be worth instrumenting just to get some numbers

@neobrain neobrain force-pushed the feature_codebuffer_sharing branch from 979d4ff to b210ad8 Compare April 10, 2025 19:14
@neobrain neobrain force-pushed the feature_codebuffer_sharing branch from 38b52e4 to 41c44a4 Compare April 18, 2025 07:39
@neobrain neobrain marked this pull request as ready for review April 18, 2025 07:52
@neobrain neobrain changed the title RFC: Reduce JIT time by 25% by sharing code buffers between threads Reduce JIT time by 25% by sharing code buffers between threads Apr 18, 2025
@neobrain
Copy link
Copy Markdown
Member Author

Haven't encountered any issues in 2 weeks of testing, so let's land this!

@Sonicadvance1
Copy link
Copy Markdown
Member

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 18, 2025

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

I had this issue on master, not this pr

@Sonicadvance1
Copy link
Copy Markdown
Member

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

I had this issue on master, not this pr

So this PR...doesn't have a race between multiple threads trying to execute the same code and link/unlink at the same time?

This approach to sharing code buffers requires all JIT (backend) compilation to be logically single-threaded

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented Apr 18, 2025

So this PR...doesn't have a race between multiple threads trying to execute the same code and link/unlink at the same time?

The race with delinking happens even on master, though actually you're right that a linking one would be new - but is avoided with the global lock in ExitFunctionLink.

@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented Apr 18, 2025

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

Yeah we all talked past each other that meeting, I think. I hadn't seen any issues (otherwise I would've noted it down here) and ByLaws was referring to an issue on main.

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

Would've been nice to bring this up without letting this PR labeled as RFC sit for 2 weeks, but sure let me know if you find anything different from my measurements on this. It's plausible that single-threaded compilation is just as effective as the previous behavior of parallel-but-redundant compilation.

@Sonicadvance1
Copy link
Copy Markdown
Member

I haven't yet looked at this since we had the discussion about block linking and unlinking being a race. What was decided on that?

Yeah we all talked past each other that meeting, I think. I hadn't seen any issues (otherwise I would've noted it down here) and ByLaws was referring to an issue on main.

This also spooks me a bit. I'll need to try some games like Cyberpunk, UE5 games, and RUINER to see if this harms performance on those highly threaded games.

Would've been nice to bring this up without letting this PR labeled as RFC sit for 2 weeks, but sure let me know if you find anything different from my measurements on this. It's plausible that single-threaded compilation is just as effective as the previous behavior of parallel-but-redundant compilation.

So this introduces a known race in block linking that can result in a tear in visibility when one thread is patching the code and another thread is executing it.
I would very much want this to be fixed before this is merged. Make sure the matrix of four race conditions (link-direct, link-indirect, unlink-direct, unlink indirect) has its data that is getting modified is aligned to 16-byte granule and do a 128-bit atomic store on it.

Also with some testing shows the global JIT lock does make stutters worse in games that hammer compilation at the same time like RUINER-Linux, and Cyberpunk 2077. RUINER even just lost 10FPS of perf, dropping from 70FPS to 60FPS. So having the threads JIT in a transient buffer and then copy to share would be preferred. A little bit of additional work that gets thrown away if multiple threads are compiling the same code is preferred to stuttering due to job queue systems hitting the same code and stalling.

Still need to throw some more UE5 games at this to see how they behave.

void Erase(FEXCore::Core::CpuStateFrame* Frame, uint64_t Address, const LockToken&) {
// Sever any links to this block
auto lower = BlockLinks->lower_bound({Address, nullptr});
auto upper = BlockLinks->upper_bound({Address, reinterpret_cast<FEXCore::Context::ExitFunctionLinkData*>(UINTPTR_MAX)});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps numeric_limits would be cleaner here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code just moved around, but I also don't think numeric_limits works for pointers (since this UINTPTR_MAX cast is technically UB anyway).

FEXCore::Allocator::VirtualFree(Buffer.Ptr, Buffer.Size);
fextl::shared_ptr<CodeBuffer> CodeBufferManager::GetCurrentCodeBuffer() {
if (!Latest) {
static constexpr size_t INITIAL_CODE_SIZE = 1024 * 1024 * 16;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably cleaner to keep such an important constant at the top of the file still


fextl::shared_ptr<CodeBuffer> GetCurrentCodeBuffer();

fextl::shared_ptr<CodeBuffer> Latest;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems best private, then GetCurrentCodeBuffer can always be used (maybe also drop GetCurrentCodeBufferSize? GetCurrentCodeBuffer()->Size is equally as short)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually going the opposite way here, I think. Since Core.cpp doesn't use this interface anymore, we can just make all state public and access variables directly. CodeBuffers.Latest isn't quite the mouthful that CodeBuffers.GetCurrentCodeBuffer() was either.

Will drop GetCurrentCodeBufferSize however.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, nevermind, I wasn't thinking. GetCurrentCodeBuffer isn't trivial, so we can't just drop it. Will follow your suggestion after all then.

CurrentCodeBuffer = CodeBuffers.GetCurrentCodeBuffer();
} else {
auto NewCodeBufferSize = CodeBuffers.GetCurrentCodeBufferSize();
NewCodeBufferSize = std::min<size_t>(NewCodeBufferSize * 2.0, MaxCodeSize);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for changing this to 2.0?

Copy link
Copy Markdown
Member Author

@neobrain neobrain Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a critically important change, but in principle this should reduce runtime use of committed memory. Here's the reasoning:

Two conflicting metrics could be considered here: Size of the current code buffer, and combined size of all CodeBuffers. The former metric prefers smaller factors (increasing only as much as necessary), but it's also less important since any unused excess memory is uncommitted (on Linux, anyway). The latter metric measures overhead of already committed memory and prefers bigger factors (since that reduces the total number of CodeBuffer versions needed due to regrowth, and since it reduces their individual relative size compared to the current CodeBuffer).

The previous factor 1.5 wastes up to twice the size of the latest CodeBuffer (1/1.5+1/(1.5**2)+1/(1.5**3)+...). The factor 2.0 in contrast will never waste more than the size of the latest CodeBuffer itself. Higher factors would theoretically fare even better, but my intuition says that it would just be diminishing returns for the average case (where most CodeBuffer versions are eventually discarded and where the latest CodeBuffer will contain some uncommitted memory).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense - might wanna drop the .0 now its an integer

// This is the current code buffer that we are tracking
CodeBuffer* CurrentCodeBuffer {};
// This is the code buffer actively used by this thread
fextl::shared_ptr<CodeBuffer> CurrentCodeBuffer;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to ActiveCodeBuffer

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CurrentCodeBuffer is more fitting. Active has a ring of "being dynamic" to it, which would be misleading here: The current CodeBuffer may be read-only and hence will need to be updated when new code is compiled.

This is a weak opinion though, if you're convinced "active" is better I can change it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment though: // This is the code buffer containing the main code under execution by this thread

@neobrain
Copy link
Copy Markdown
Member Author

Thanks for testing!

So this introduces a known race in block linking that can result in a tear in visibility when one thread is patching the code and another thread is executing it. I would very much want this to be fixed before this is merged. Make sure the matrix of four race conditions (link-direct, link-indirect, unlink-direct, unlink indirect) has its data that is getting modified is aligned to 16-byte granule and do a 128-bit atomic store on it.

This has been addressed in #4528 now.

Also with some testing shows the global JIT lock does make stutters worse in games that hammer compilation at the same time like RUINER-Linux, and Cyberpunk 2077. RUINER even just lost 10FPS of perf, dropping from 70FPS to 60FPS. So having the threads JIT in a transient buffer and then copy to share would be preferred. A little bit of additional work that gets thrown away if multiple threads are compiling the same code is preferred to stuttering due to job queue systems hitting the same code and stalling.

RUINER with its >700 thread spawns per second is a rather pathological case, but making stutters in Cyberpunk noticeably worse would indeed be an undesirable side effect.

I'll look into re-enabling parallel compilation but will work on advancing the code caching work a bit more first. Putting this back to draft mode for now.

@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented May 5, 2025

I realized we can hold the CodeBufferWriteMutex for a much shorter time, as only the actual code emission and offset update must be protected by it. This reduced lock contention in Yooka Laylee by 66%. Compiling to a temporary buffer can yield an additional 11% reduction (from the baseline). I implemented a prototype for this (see b6af251) but the design decisions are nontrivial so I would rather do this in a follow-up PR.

@Sonicadvance1 Could you retest Cyberpunk with the new changes? If the stuttering persists, can you quantify how much worse it is? Did you find any other reasonable games that were impacted negatively?

Ultimately there is a trade-off between different types of workloads: Stuttering might be reduced in only some games while battery drain should be reduced across the board. Hopefully this latest update shifts the trade-off closer towards the optimum, though, and on-disk code caching should more than make up the few regressions eventually.

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented May 5, 2025

Quite interesting how slow ICache flush takes if it meaningfully increases contention more than the actual emission being under the lock. I wonder if on some other occasion after this PR its a good idea to profile that.

@neobrain neobrain force-pushed the feature_codebuffer_sharing branch 3 times, most recently from 2a749cd to cc92250 Compare May 9, 2025 15:40
@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented May 9, 2025

Compiling to a temporary buffer can yield an additional 11% reduction (from the baseline). I implemented a prototype for this (see b6af251) but the design decisions are nontrivial so I would rather do this in a follow-up PR.

Implementing the same approach but using ThreadPoolAllocator to manage the temporary buffer improved the reduction to around 20% (from the baseline). It also resolves most of the design questions, so I implemented it as part of this PR now after all. Every thread will compile to a temporary buffer now, and the global lock is only taken to move the temporary output to the current CodeBuffer. (I still need to hunt down the source for the one unit test failure, but it's all working fine from real-world testing.)

Unfortunately we have to take a hit on space overhead for this when multiple threads compile the same block: Preventing this would require checking the LookupCache while CodeBufferWriteMutex is still locked, however that in turn requires waiting for the LookupCache mutex to become available (which increases overall lock contention again). We should probably look into maintaining some "in-progress" queue instead that threads can check before even attempting to compile in the first place, but this is out of scope for this PR.

During testing this also revealed that our codegen size estimate has been too conservative, which was a latent source of bugs/crashes.

@Art-Chen
Copy link
Copy Markdown

testing with latest head of this mr on Wine ARM64EC, i meet hung when :

  • cpu-z benchmark
  • exiting MiSide(which is an Windows Unity Game)
  • Starting Windows Steam and hung after login screen

fps drop on GTA 5

even if rebase this mr to the main branch, still the same.

maybe need some check on Windows ARM64EC?

@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented May 26, 2025

Thanks for testing, @Art-Chen ! Just some clarifying questions to see about reproducing this.

testing with latest head of this mr on Wine ARM64EC, i meet hung when :

cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

fps drop on GTA 5

How much did fps drop?

Wine ARM64EC

Are you using upstream Wine or ByLaws's fork?

@Art-Chen
Copy link
Copy Markdown

testing with latest head of this mr on Wine ARM64EC, i meet hung when :
cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

Every time.

How much did fps drop?

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Are you using upstream Wine or ByLaws's fork?

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

@neobrain
Copy link
Copy Markdown
Member Author

Every time.

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

Interesting, though cpu-z is crashing for me with upstream Wine (git) and upstream FEX (2505) already. Could you double-check that the issues are fixed when going back to FEX d358768 ?

@Art-Chen
Copy link
Copy Markdown

though cpu-z is crashing for me with upstream Wine (git) and upstream FEX (2505) already

CPU-Z crash is known issue and need an workaround patch for wine to get it to run.

You can just try running MiSide to test it.
https://store.steampowered.com/app/2527500/_MiSide/

@bylaws
Copy link
Copy Markdown
Collaborator

bylaws commented May 26, 2025

testing with latest head of this mr on Wine ARM64EC, i meet hung when :
cpu-z benchmark
exiting MiSide(which is an Windows Unity Game)
Starting Windows Steam and hung after login screen

How often does this happen? Does it happen every time or only sometimes?

Every time.

How much did fps drop?

about 15-20fps. (testing with wine client sync patches, ie. esync / fsync and so on.)

Are you using upstream Wine or ByLaws's fork?

Upstream Wine and picked some patch by ByLaws (AVX / SVE / some hack)

There's a known bug here Ifrom a couple weeks back 'm not sure I pushed a fix toy upstream-arm64ec branch for. Might also be missing suspend patches. Could you use the arm64ec-10 branch please?

@Art-Chen
Copy link
Copy Markdown

Might also be missing suspend patches

I tried to pick up the suspend related patches from your branch, but it will cause cef based application hang when using client sync(esync/fsync). so i reverted it.

Could you use the arm64ec-10 branch please?

i'll try later. and could you telling more detail about this issue? and which patch addressed this issue? Thanks!

@Sonicadvance1
Copy link
Copy Markdown
Member

Needs a rebase but I'm now looking at this.

@Sonicadvance1
Copy link
Copy Markdown
Member

CyberPunk 2077 is worse off with this PR because of its consistent work threads code invalidation and jitting causing significant per-frame contention, but I don't think it is a blocker.

UE5 games so far don't seem worse off, might be slightly better off even since their worker threads would end up randomly jitting new code that a different thread would have executed prior.

@Sonicadvance1
Copy link
Copy Markdown
Member

Recent changes with this seems to improve RUINER on native Linux. Around 65FPS with this PR versus around 30 on main due to all the VLC jitting overhead.

neobrain added 14 commits June 1, 2025 22:42
This data isn't really a cache, since the JIT is directly responsible of
writing its contents. Instead it be considered the source to populate the
L1/L2 caches from.

Furthermore, splitting off this data allows it to be shared across threads
in the future without affecting L1/L2 caches.
This is required for sharing CodeBuffers between threads anyway, but it also
allows use of the constructor/destructor to manage memory automatically.
This is changes the interface of CodeBuffer to that of a partially persistent
data structure based on reference counting:
- Exactly one CodeBuffer is now designated as "active", which means data can
  be *appended* to it
- Lossy modifications to the active CodeBuffer will not invalidate any data
  in use by other threads, which enables save sharing across threads
- Instead, such lossy modifications trigger a new "version" of the data in
  the modifying thread. Old versions of the CodeBuffer persist as read-only
  data for use by the other threads.
- The other threads can update their version of the CodeBuffer. This will
  decrease the reference count and eventually trigger deallocation of the
  old version
This further reduces lock contention by skipping the backend phase in case
another thread raced the active one for the same block.
The previous bound was exceeded during Steam startup before.
This no longer works since the JIT output is now relocated before execution.
@neobrain neobrain force-pushed the feature_codebuffer_sharing branch from d46e431 to a792dd0 Compare June 1, 2025 20:48
@Sonicadvance1 Sonicadvance1 merged commit e794584 into FEX-Emu:main Jun 2, 2025
12 checks passed
@neobrain
Copy link
Copy Markdown
Member Author

neobrain commented Jun 2, 2025

Forgot replying to this before, but thanks for re-testing! I'll note down Cyberpunk for future benchmarking since there's some more low-hanging fruit for optimization here.

@neobrain neobrain deleted the feature_codebuffer_sharing branch July 2, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants