Skip to content

Multithreaded evaluator#10938

Closed
edolstra wants to merge 82 commits intoNixOS:masterfrom
DeterminateSystems:multithreaded-eval
Closed

Multithreaded evaluator#10938
edolstra wants to merge 82 commits intoNixOS:masterfrom
DeterminateSystems:multithreaded-eval

Conversation

@edolstra
Copy link
Member

@edolstra edolstra commented Jun 19, 2024

Motivation

This PR makes the evaluator thread-safe. Currently, only nix flake search and nix flake show make use of multi-threaded evaluation to achieve a speedup on multicore systems.

Unlike the previous attempt at a multi-threaded evaluator, this one locks thunks to prevent them from being evaluated more than once. The life of a thunk is now:

  • On the first forceValue() call, the thunk type goes from tThunk to tPending.
  • If another thread does a forceValue() on a thunk in the tPending state, it acquires a lock to register itself as "awaiting" that value, and sets the type to tAwaited.
  • Once the first thread finished the value and its type is tAwaited, it updates the value and wakes up the threads that are waiting. If the type is tPending, it just updates the value normally.

Also, there now is a tFailed value type that stores an exception pointer to represent the case where thunk evaluation throws an exception. In that case, every thread that forces the thunk should get the same exception.

To enable multi-threaded evaluation, you need to set the NR_CORES environment variable to the number of threads to use. You can also set NIX_SHOW_THREAD_STATS=1 to get some debug statistics.

Some benchmark results on a Ryzen 5900X with 12 cores and 24 hyper-threads:

  • GC_INITIAL_HEAP_SIZE=8G nix flake show --eval-cores 12 --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298 went from 23.70s to 5.77s.
  • GC_INITIAL_HEAP_SIZE=6G nix search --eval-cores 16 --no-eval-cache github:NixOS/nixpkgs/bf8462aeba50cc753971480f613fbae0747cffc0?narHash=sha256-bPyv7hsbtuxyL6LLKtOYL6QsmPeFWP839BZQMd3RoUg%3D ^ went from 11.82s to 3.88s.

Note: it's good to set GC_INITIAL_HEAP_SIZE to a high value because stop-the-world garbage collection is expensive.

To do:

  • Infinite recursion detection through blackholing is disabled when parallel eval is enabled.
  • More commands should be multi-threaded, in particular nix flake check.
  • We should have some auto-parallelization of single evaluations (like NixOS system configurations). One way to do this would be to evaluate all attributes of a derivation in parallel.
  • This PR make some high contention data structures (in particular the symbol table) more multi-thread friendly, but there is more that can be done.
  • The Executor class currently executes work items in random order to reduce the probability that we execute a bunch of items at the same time that all depend on the same thunk, causing all but one to be blocked. This can probably be improved.

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

edolstra added 30 commits May 20, 2024 10:09
This is a mapping from paths to "resolved" paths (i.e. with
`default.nix` added, if appropriate). `fileParseCache` and
`fileEvalCache` are now keyed on the resolved path *only*.
Previously, the optimistic concurrency approach in `evalFile()` meant
that a `nix search nixpkgs ^` would do hundreds of duplicated
parsings/evaluations. Now, we reuse the thunk locking mechanism to
ensure it's done only once.
This refactoring allows the symbol table to be stored as something
other than std::strings.
This allows symbol IDs to be offsets into an arena whose base offset
never moves, and can therefore be dereferenced without any locks.
This makes it less likely that we concurrently execute tasks that
would block on a common subtask, e.g. evaluating `libfoo` and
`libfoo_variant` are likely to have common dependencies.
@roberth
Copy link
Member

roberth commented Dec 6, 2024

  • Infinite recursion detection through blackholing is currently disabled.

Relevant paper with a design that implements this.

It's quite a different architecture, but maybe we need that? Specifically, they check blackholes on the stack during GC.

I guess the alternative is to add more bookkeeping so that perhaps another thread could be tasked with periodically checking that no thread is deadlocked, and if they are, insert exception thunks, unblock them, and let them unwind their stacks.
That hinges on the overhead of the added bookkeeping.

Fwiw, they have

lock-free thunks

It sure would be nice not to regress single-threaded performance, and have a boost in multi-threaded as well.

I wouldn't oppose taking control of the stack instead of delegating all of that completely to bdwgc. Doing so would open the door to other benefits, such as more efficient --debugger, printing warning traces with aborting, and a more reliable and/or higher max-call-depth. (OT: but not necessarily tail calls, because that's more or less s/if/while in forceValue)

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/determinate-nix-3-0/61202/189

@crertel crertel mentioned this pull request Mar 10, 2025
@edolstra edolstra force-pushed the multithreaded-eval branch from a134740 to d300e13 Compare March 11, 2025 21:32
@crertel
Copy link

crertel commented Mar 12, 2025

@roberth @Ericson2314 was there more to be done on here? Reading through the history, it looked like there were some concerns about single-threaded performance regression, but was there something else that makes this not ready for prime-time yet?

@CertainLach
Copy link
Member

CertainLach commented Mar 17, 2025

Reading through the history, it looked like there were some concerns about single-threaded performance regression, but was there something else that makes this not ready for prime-time yet?

Blackholing is still missing, thus in case of infinite recursion nix is just freezing on this branch, instead of throwing an error.

@crertel
Copy link

crertel commented Mar 17, 2025

Reading through the history, it looked like there were some concerns about single-threaded performance regression, but was there something else that makes this not ready for prime-time yet?

Blackholing is still missing, thus in case of infinite recursion nix is just freezing on this branch, instead of throwing an error.

Is that error condition easy to provoke, or a nice-to-have?

@roberth
Copy link
Member

roberth commented Mar 18, 2025

Blackholing

easy to provoke, or a nice-to-have?

Easy to provoke and very important for users.

const char * c_str() const
{
return s->c_str();
return s.data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String views are not guaranteed to be null-terminated, so this might be incorrect in certain cases.

@dsalaza4
Copy link

Hi! Really looking forward to this. Are there any plans on releasing it soon?

@edolstra
Copy link
Member Author

Closing this PR because it's outdated. An up-to-date version based on the new 16-byte Value type can be found here.

@edolstra edolstra closed this Jul 15, 2025
@bjornfor
Copy link
Contributor

Closing this PR because it's outdated. An up-to-date version based on the new 16-byte Value type can be found here.

Why is the PR made against https://github.com/DeterminateSystems/nix-src and not https://github.com/NixOS/nix?

@waltmck
Copy link

waltmck commented Jul 15, 2025

Closing this PR because it's outdated. An up-to-date version based on the new 16-byte Value type can be found here.

Why is the PR made against https://github.com/DeterminateSystems/nix-src and not https://github.com/NixOS/nix?

Is this surprising? Determinate Systems stands to directly monetarily benefit from competing with the upstream Nix implementation. Why then would you expect its employees to act against their interests by prioritizing upstream contributions?

This is not obviously an entirely bad thing. The GPL means that any downstream contributions can eventually be upstreamed, so everyone stands to benefit from this work (and contributors have a right to try to monetize their contributions in the meantime). I am just confused as to why everybody seems to expect the DetNix contributors to act against their own self-interest---maybe there is something else here that I am misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c api Nix as a C library with a stable interface fetching Networking with the outside (non-Nix) world, input locking new-cli Relating to the "nix" command repl The Read Eval Print Loop, "nix repl" command and debugger store Issues and pull requests concerning the Nix store with-tests Issues related to testing. PRs with tests have some priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.