reduce the size of Attr from 3 pointers to 2 on 64 bit machines#6218
reduce the size of Attr from 3 pointers to 2 on 64 bit machines#6218thufschmitt merged 12 commits intoNixOS:masterfrom pennae:pos-symbol-tables
Conversation
|
🎉 All dependencies have been resolved ! |
|
rebased to current master now. :) |
|
Do you have benchmarks on the eval perf improvement here? |
|
@lovesegfault benchmarks results are in the final commit message, copied here for convenience: |
| SymbolPtr name; | ||
| Value * value; | ||
| PosPtr pos; | ||
| Value * value; |
There was a problem hiding this comment.
Maybe add a comment that the order of these was choosen purposely?
And perhaps some test/assertion that ensure that this expectation still holds true in the future?
There was a problem hiding this comment.
added a comment describing the reasoning here. not sure an assertion has much value though, do you have a scenario in mind that might interfere here?
There was a problem hiding this comment.
Since it has a signitifcant impact I think the trivial issue could be that someone adds another field here without fully understanding the consequences. Also just some refactoring (prior to your comment) might have lead to someone moving them around (for various reasons) without knowing that the layout was chosen deliberately.
There was a problem hiding this comment.
added an assertion for the expected size with a warning about why it's important.
|
puzzling that the edit: turns out, use-after-free of parser data in string_view. c++ needs lifetimes 🙄 |
|
@edolstra this has been getting a lot of merge conflicts recently, does it make sense to keep resolving them or should we wait until this is officially approved? |
|
NIX_SHOW_STATS trimmed for brevity |
|
rebased again. also renamed |
|
I am not up to date on these things and so could not review very way, but perhaps @layus can after #6204. I think it's good to have the people that do performance and diagnostic quality reviewing each other, as those goals in the general case have some tension. (To be clear, I don't see a sign of diagnostic quality regression in this case.) |
|
installer test has been stuck for nine days. nice. |
layus
left a comment
There was a problem hiding this comment.
No real concern here for me, except for the slight readability decrease that often comes with performance tuning.
As this is a huge amount of work, I think a decision should be taken soon on whether we want to continue in this direction or not.
src/libexpr/parser.y
Outdated
| { $$ = new ExprLambda(CUR_POS, data->symbols.create($1), 0, $3); } | ||
| | '{' formals '}' ':' expr_function | ||
| { $$ = new ExprLambda(CUR_POS, data->symbols.create(""), toFormals(*data, $2), $5); } | ||
| { $$ = new ExprLambda(CUR_POS, {}, toFormals(*data, $2), $5); } |
There was a problem hiding this comment.
This could use a comment for future readers. A dummy function with a proper name would help a lot.
There was a problem hiding this comment.
how about
| { $$ = new ExprLambda(CUR_POS, {}, toFormals(*data, $2), $5); } | |
| { $$ = new ExprLambda(CUR_POS, SymbolIdx::none, toFormals(*data, $2), $5); } |
which would be a new constant akin to noPos?
There was a problem hiding this comment.
added another constructor to ExprLambda instead because C++ really doesn't like constexpr data members of a type to be declared inside the type, and adding a noSymbol to the surrounding namespace for a single use seemed rather cluttering.
src/libexpr/eval.hh
Outdated
| [[gnu::noinline]] | ||
| void addErrorTrace(Error & e, const char * s, const std::string & s2) const; | ||
| [[gnu::noinline]] | ||
| void addErrorTrace(Error & e, const Pos & pos, const char * s, const std::string & s2) const; |
There was a problem hiding this comment.
This is the big pain point for me. Now these hacky functions leak into a much used, public header. I understand the purpose of these functions, but I would really like to find a more DRY alternative to this.
There was a problem hiding this comment.
could also make them private and make all the *Expr classes friends of EvalState, but that feels even more of a hack than this. we might be able to make them inline instead and define them only in eval-state.cpp to make all other uses compile time errors, but that only adds to the hackiness as well 😕
There was a problem hiding this comment.
How about going the other way around, and lookup the {Symbol,Position}Idx before passing them to throw*Error ? I My understanding is that the stack usage should not change. If this machinery is indeed about reducing stack space, then it is about the same.
That would avoid having to expose all these dummy and ill-defined functions in the headers, and let them settle when we should find a way to get rid of them.
Obviously it would make the code a bit more verbose, due to the repeated lookups, but that is not a big deal for that code in particular. If we wanted it terse, we should first take care of the try/catch and re-throwing bits, that clutter the eval and primops files.
There was a problem hiding this comment.
we did that at first, but it does clutters the call sites something fierce. this was mostly a readability choice as well and not something we're particularly attached to or did for runtime metrics. the symbols table is a public member of EvalState though, so we could make that parameter explicit and remove the declarations from the header again. (which then creates another problem: where does the argument go? first, last, somewhere in between?)
| { | ||
| auto aDrvPath = getAttr(root->state.sDrvPath, true); | ||
| auto aDrvPath = getAttr("drvPath", true); | ||
| auto drvPath = root->state.store->parseStorePath(aDrvPath->getString()); |
There was a problem hiding this comment.
Do I read correctly that this disables the former optimization of turning common strings into symbols ?
There was a problem hiding this comment.
in a manner of speaking. the optimization doesn't have a great impact here since the symbol is soon turned into a string again and used in an sqlite query, so the extra symbol table lookup we may have to do isn't very expensive in comparison.
There was a problem hiding this comment.
Right, I guess the few places where you had to move back from symbols to strings are the ones where you do not have access to a symbol table ? I see hot paths like the one in pirmops.cc still use the state->sFooBar "global" symbols.
There was a problem hiding this comment.
we do have access to the symbol in this location too (through root->state), in this case it just seemed more readable to not use it. wouldn't be a problem to use it here as well though if it bothers anyone.
There was a problem hiding this comment.
Just trying to understand the code really, so not bothered by anything here. From the diff it looks like a de-optimisation, but it is indeed restricted to a few places. All fine for me.
|
Does the memory usage regress when evaluating on 32-bit platforms? |
|
haven't tested 32 bit, but it should also decrease. Pos now elides a Symbol (aka pointer) per instances at the cost of two ints and a std::string per file. that should break even at five positions per file on average for 32 bit machines, and if nixpkgs is anything to go by that shouldn't be hard to achieve :) |
thufschmitt
left a comment
There was a problem hiding this comment.
Thanks, the speedup and memory reduction is much appreciated.
The diff is quite massive, so I might have missed something, but apart from the few comments I’ve left, looks good
src/nix/repl.cc
Outdated
| if (v.type() == nPath || v.type() == nString) { | ||
| PathSet context; | ||
| auto filename = state->coerceToString(noPos, v, context); | ||
| return {state->symbols.create(*filename), 0}; |
There was a problem hiding this comment.
Don’t you just create a symbol out of the string, and then convert it back into a string here?
There was a problem hiding this comment.
for the moment yes, but that's just to keep the visible diff small-ish. a later commit removes the symbol creation here
| // must always be invalid by default, add() replaces this with the actual value. | ||
| // subsequent add() calls use this index as a token to quickly check whether the | ||
| // current origins.back() can be reused or not. | ||
| mutable uint32_t idx = std::numeric_limits<uint32_t>::max(); |
There was a problem hiding this comment.
Rather than setting this after the fact, can’t we just require an Origin to only be created from a PosTable (and so get a valid idx right when initialized) ? Like it’s done for Symbols
There was a problem hiding this comment.
no, instances of this type double as origin information within the symbol table and as cookie values for users of the symbol table to enable reuse of origin entries. the idea is to have each site that creates positions allocate an Origin with an invalid idx, which the symbol table fills with a valid value when a position is created. each subsequent position creation need only check that the idx of the cookie is the idx of the last origin in the table and reuse that origin if it is, otherwise create a new origin and change the idx of the cookie. it turns out doing it this way is a lot more efficient than having each position creation pass the full file/origin pair and check those.
edit to clarify: multiple Origin cookies are allowed to exist at once, eg when doing nested parsing for some reason. once the second level has created a position the first level cookie is invalid and has to be updated transparently, otherwise very obscure bugs will happen
There was a problem hiding this comment.
Oh yes, didn’t think about nesting the parsing (dunno whether that happens in practice, but it’s definitely safer to assume that it can)
src/libexpr/nixexpr.hh
Outdated
| static constexpr unsigned CHUNK_SIZE = 8192; | ||
|
|
||
| std::vector<Origin> origins; | ||
| std::vector<std::vector<Offset>> offsets; |
There was a problem hiding this comment.
Do these nested vectors really provide a performance gain? And if so, could that be moved to its own class so that we decouple the storing logic from the actual caching? I think that would help a lot in making this easier to understand
There was a problem hiding this comment.
(and similar for the symbols ofc)
There was a problem hiding this comment.
they do once you have a lot of them. allocating multiple fixed-size vectors is faster than reallocating a vector that keeps growing and has less overhead once the tables get large, at the expense of an additional indirection during lookup. extracting that to a separate type won't be a problem.
src/libexpr/eval-cache.cc
Outdated
| } | ||
|
|
||
| ref<AttrCursor> AttrCursor::getAttr(Symbol name, bool forceErrors) | ||
| std::shared_ptr<AttrCursor> AttrCursor::getAttr(std::string_view name, bool forceErrors) |
There was a problem hiding this comment.
Any reason why this is a std::shared_ptr and not a ref?
There was a problem hiding this comment.
can't remember, to be honest. it might just be wrong.
|
addressed @thufschmitt review comments :) |
we don't *need* symbols here. the only advantage they have over strings is making call-counting slightly faster, but that's a diagnostic feature and thus needn't be optimized. this also fixes a move bug that previously didn't show up: PrimOp structs were accessed after being moved from, which technically invalidates them. previously the names remained valid because Symbol copies on move, but strings are invalidated. we now copy the entire primop struct instead of moving since primop registration happen once and are not performance-sensitive.
the only use of this function is to determine whether a lambda has a non-set formal, but this use is arguably better served by Symbol::set and using a non-Symbol instead of an empty symbol in the parser when no such formal is present.
a future commit will remove the ability to convert the symbol type used in bindings to strings. since we only have two users we can inline the error check.
only file and line of the returned position were ever used, it wasn't actually used a position. as such we may as well use a path+int pair for only those two values and remove a use of Pos that would not work well with a position table.
when we introduce position and symbol tables we'll need to do lookups to turn indices into those tables into actual positions/symbols. having the error functions as members of EvalState will avoid a lot of churn for adding lookups into the tables for each caller.
Pos objects are somewhat wasteful as they duplicate the origin file name and input type for each object. on files that produce more than one Pos when parsed this a sizeable waste of memory (one pointer per Pos). the same goes for ptr<Pos> on 64 bit machines: parsing enough source to require 8 bytes to locate a position would need at least 8GB of input and 64GB of expression memory. it's not likely that we'll hit that any time soon, so we can use a uint32_t index to locate positions instead.
PosTable deduplicates origin information, so using symbols for paths is no longer necessary. moving away from path Symbols also reduces the usage of symbols for things that are not keys in attribute sets, which will become important in the future when we turn symbols into indices as well.
this slightly increases the amount of memory used for any given symbol, but this increase is more than made up for if the symbol is referenced more than once in the EvalState that holds it. on average every symbol should be referenced at least twice (once to introduce a binding, once to use it), so we expect no increase in memory on average. symbol tables are limited to 2³² entries like position tables, and similar arguments apply to why overflow is not likely: 2³² symbols would require as many string instances (at 24 bytes each) and map entries (at 24 bytes or more each, assuming that the map holds on average at most one item per bucket as the docs say). a full symbol table would require at least 192GB of memory just for symbols, which is well out of reach. (an ofborg eval of nixpks today creates less than a million symbols!)
with position and symbol tables in place we can now shrink Attr by a full
pointer with some simple field reordering. since Attr is a very hot struct this
has substantial impact on memory use, decreasing GC allocations and heap size by
10-15% each. we also get a ~15% performance improvement due to reduced GC
loading.
pure parsing has taken a hit over the branch base because positions are now
slightly more expensive to create, but overall we get a noticeable improvement.
before (on memory-friendliness):
Benchmark 1: nix search --no-eval-cache --offline ../nixpkgs hello
Time (mean ± σ): 6.960 s ± 0.028 s [User: 5.832 s, System: 0.897 s]
Range (min … max): 6.886 s … 7.005 s 20 runs
Benchmark 2: nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 328.1 ms ± 1.7 ms [User: 295.8 ms, System: 32.2 ms]
Range (min … max): 324.9 ms … 331.2 ms 20 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.688 s ± 0.029 s [User: 2.365 s, System: 0.238 s]
Range (min … max): 2.642 s … 2.742 s 20 runs
after:
Benchmark 1: nix search --no-eval-cache --offline ../nixpkgs hello
Time (mean ± σ): 6.902 s ± 0.039 s [User: 5.844 s, System: 0.783 s]
Range (min … max): 6.820 s … 6.956 s 20 runs
Benchmark 2: nix eval -f ../nixpkgs/pkgs/development/haskell-modules/hackage-packages.nix
Time (mean ± σ): 330.7 ms ± 2.2 ms [User: 300.6 ms, System: 30.0 ms]
Range (min … max): 327.5 ms … 334.5 ms 20 runs
Benchmark 3: nix eval --raw --impure --expr 'with import <nixpkgs/nixos> {}; system'
Time (mean ± σ): 2.330 s ± 0.027 s [User: 2.040 s, System: 0.234 s]
Range (min … max): 2.272 s … 2.383 s 20 runs
it's no longer needed now that positions aren't really pointers any more.
|
and rebased to master once more, for good measure |
thufschmitt
left a comment
There was a problem hiding this comment.
addressed @thufschmitt review comments :)
Thanks :) I’ve pushed a couple of commits to move ChunkedVector to its own header and add some small tests. I’ll merge as soon as the CI is green
|
not having high hopes for CI to be honest, the installer test has been hating this PR for weeks xD |
|
looks like the auto-merge was a bit over-eager again, the macos build failed: |
|
Duh, that thing is indeed totally broken 😠 |
|
Hi, I was wondering if (modulo the forEach helper, but |
|
@dtzWill was considered but (maybe mistakenly) deemed unsuitable because there are no allocation size guarantees for its chunks. part of the reason for chunking is to keep a constant maximum memory overhead for unused capacity, if we just didn't find the way to get that done with a deque then a deque would work just as well |
| auto name = cursor->getAttr(state.sName)->getString(); | ||
| auto outPath = cursor->getAttr("outPath")->getString(); | ||
| auto outputName = cursor->getAttr("outputName")->getString(); | ||
| auto name = cursor->getAttr("name")->getString(); |
There was a problem hiding this comment.
I don't understand this and similar changes. Not that this is on the hot path, but isn't it worse for performance to replace a lookup of a symbol by a lookup of a string?
There was a problem hiding this comment.
we changed these for readability. ioo it was preferrable to do it like this instead of as
auto name = cursor->getAttr(state.symbols[state.sName])->getString();
but ultimately both of them are totally valid.
|
Question: we now have two ways to uniquely refer to symbols, namely |
| std::vector<Symbol> res; | ||
| for (auto & a : parseAttrPath(s)) | ||
| res.push_back(state.symbols.create(a)); | ||
| res.emplace_back(a); |
There was a problem hiding this comment.
I'm not sure I understand this. How does this guarantee that the symbol is unique (i.e. pointer-comparable)?
There was a problem hiding this comment.
it does not, as of this PR Symbol does not confer a uniqueness guarantee, only a "this is a nix identifier" semantics. we can remove Symbol and rename SymbolIdx -> Symbol to restore that guarantee to the name if that's preferrable
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: |
|
@pennae what do you think of a scheme like this, that replaces the position table by a clever indexing scheme? It would enable some improvements to error messages; see the github reverse links to the issues that come after it. |
this changes position handling to use a table similar to SymbolTable for storing position information, and changes SymbolTable itself to require less memory per Symbol at the expense of lookup convenience. with both of these tables using 32 bit indices we can pack Attr more tightly, reducing GC heap size and allocations by 10-15%. reduced GC load also gives us an eval performance improvement on the same order for free.
depends on #5865