Fix segfault on infinite recursion in some cases by 9999years · Pull Request #9617 · NixOS/nix

9999years · 2023-12-15T21:29:55Z

Context

Closes #9616.

This fixes a segfault on infinite function call recursion (rather than infinite thunk recursion) by tracking the function call depth in EvalState.

Additionally, to avoid printing extremely long stack traces, stack frames are now deduplicated, with a (19997 duplicate traces omitted) message. This should only really be triggered in infinite recursion scenarios.

Before:

$ nix-instantiate --eval --expr '(x: x x) (x: x x)'
Segmentation fault: 11

After:

$ nix-instantiate --eval --expr '(x: x x) (x: x x)'
error: stack overflow

       at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

$ nix-instantiate --eval --expr '(x: x x) (x: x x)' --show-trace
error:
       … from call site

         at «string»:1:1:

            1| (x: x x) (x: x x)
             | ^

       … while calling anonymous lambda

         at «string»:1:2:

            1| (x: x x) (x: x x)
             |  ^

       … from call site

         at «string»:1:5:

            1| (x: x x) (x: x x)
             |     ^

       … while calling anonymous lambda

         at «string»:1:11:

            1| (x: x x) (x: x x)
             |           ^

       … from call site

         at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

       (19997 duplicate traces omitted)

       error: stack overflow

       at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

Future work

It would be nice if the maximum function call depth could be customized like --option function-call-depth-limit 500.
~~In some cases, this may print a message like (1 duplicate traces omitted), which is kind of unhelpful. It would be nice to only print that message if more than a couple traces are omitted.~~

Priorities

Add 👍 to pull requests you find important.

9999years · 2023-12-15T21:30:12Z

src/libexpr/eval.cc


 void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
 {
+    if (depth > 10000)


Chosen arbitrarily.

in a strong tradition 😆

nix/src/libexpr/parser.y

Line 202 in 1e3d811

size_t minIndent = 1000000;

I don't think 10k stack levels is ever reasonable to do, so it seems fair. Magic number should perhaps be extracted to a constexpr though!

What happens if we stack overflow before that?

Status quo: segmentation fault on macOS because our segv handler is Linux x86 only, or so I've heard, and a less than helpful error message if our segv handler does actually work.

Fixing the handler is beyond the scope of this specific work.

Apparently I'm doing something unreasonable(?), because I'm hitting 12k call depth in two expressions (both doing import-from-derivation). Luckily I can use max-call-depth = 20000 to fix it, but I have to distribute this setting to not leak this detail to users.

What does 10k mean when translated into memory usage? Or, what's the call depth at which the Nix evaluator would segfault at before? (I guess it varies depending on platform?) Although 10k sounds a lot, I have no idea where the segfault would typically hit. 100k? 1M?

src/libexpr/eval.cc

lf- · 2023-12-15T21:48:24Z

src/libexpr/eval.cc


 void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
 {
+    if (depth > 10000)


in a strong tradition 😆

nix/src/libexpr/parser.y

Line 202 in 1e3d811

size_t minIndent = 1000000;

I don't think 10k stack levels is ever reasonable to do, so it seems fair. Magic number should perhaps be extracted to a constexpr though!

src/libexpr/eval.cc

src/libutil/error.cc

thufschmitt

Thanks,

I'm not the right person to judge on the overall mechanism (it feels a bit brittle to me because of the arbitrary magic value, but it's also incredibly nice to have a stack trace – esp on macOS). I'll let others (@edolstra or @roberth in particular) have a more informed opinion here.

Other than that, I let a couple comments. The implementation is great, my main concern is that it doesn't have an impact on the overall performance.

thufschmitt · 2023-12-18T09:44:36Z

src/libutil/error.cc

+inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)
+{
+    if (lhs.line != rhs.line)
+        return lhs.line < rhs.line;
+    if (lhs.column != rhs.column)
+        return lhs.column < rhs.column;
+    return false;
+}
+inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }
+inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }
+inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }
+
+inline bool operator<(const Trace& lhs, const Trace& rhs)
+{
+    if (lhs.pos != rhs.pos) {
+        if (!lhs.pos)
+            return true;
+        if (!rhs.pos)
+            return false;
+        return *lhs.pos < *rhs.pos;
+    }
+
+    if (lhs.hint.str() != rhs.hint.str())
+        return lhs.hint.str() < rhs.hint.str();
+
+    if (lhs.frame != rhs.frame)
+        return lhs.frame < rhs.frame;
+
+    return false;
+}
+inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }
+inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }
+inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }


Unless you care about the exact semantics of the ordering, I think this could be replaced by a bit of nasty cpp:

Suggested change

inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)

{

if (lhs.line != rhs.line)

return lhs.line < rhs.line;

if (lhs.column != rhs.column)

return lhs.column < rhs.column;

return false;

}

inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }

inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }

inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }

inline bool operator<(const Trace& lhs, const Trace& rhs)

{

if (lhs.pos != rhs.pos) {

if (!lhs.pos)

return true;

if (!rhs.pos)

return false;

return *lhs.pos < *rhs.pos;

}

if (lhs.hint.str() != rhs.hint.str())

return lhs.hint.str() < rhs.hint.str();

if (lhs.frame != rhs.frame)

return lhs.frame < rhs.frame;

return false;

}

inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }

inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }

inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }

GENERATE_CMP_EXT(inline, AbstraxtPos, me->line, me->column);

GENERATE_CMP_EXT(inline, Trace, me->pos, me->hint.str(), me->frame);

(where GENERATE_CMP_EXT comes from libutil/comparator.hh)

std::tie seems like a great middle ground of having specified semantics, but less code.

In the C++20 future, one might want to consider using the three-way comparison operator <=>. Its implementation allows the compiler to deduce the rest.
This will have multiple advantages:

we can get rid of preprocessor macros

strange semantics that are already implemented individually in the different operator implementation might surface in the light of a unified implementation of the <=> op

Interesting. Do you know what's blocking this? We do use c++2a. Should we track it somewhere?

I don't know if anything blocks this at all. Maybe it's just that no one thought about using it, yet.

one of the fields here is shared_ptr which by default has very undesirable comparison semantics (pointer address).

Well, if the only reason we need that comparison operator is putting stuff in a Set, pointer comparisons is quite reasonable (and as fast as it can get)

Pointer comparisons are not reasonable, because each Trace is newly constructed. There's no interning. So it's quite common to have position values that are semantically equal (same line and column number) with different pointer values. I've tried this change and it breaks the tests.

That said, I can simplify these comparisons somewhat and I can replace the AbstractPos one with the default.

I wonder how costly it would be to intern positions. We quite possibly should do that anyhow, but at a future date.

Looks like LLVM only got <=> for tuples implemented last year: llvm/llvm-project#50396

EDIT: And std::string. Looks like we'll need clang 16 for this.

I wonder how costly it would be to intern positions. We quite possibly should do that anyhow, but at a future date.

We do have PosTable.

This PR's use case is not in the hot path, so we don't need to explore this now I think.

thufschmitt · 2023-12-18T09:57:05Z

src/libexpr/eval.cc


 void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
 {
+    if (depth > 10000)


What happens if we stack overflow before that?

thufschmitt · 2023-12-18T10:02:08Z

src/libexpr/eval.cc

+    if (depth > 10000)
+        error("stack overflow").atPos(pos).template debugThrow<EvalError>();
+    CallLevel _level(depth);


I think we'll want to benchmark this to make sure that it doesn't bear a noticeable evaluation performance impact.
My usual benchmark is a simple nix search nixpkgs blah --option eval-cache false, but maybe @tfc will have something more meaningful to suggest since he's been working heavily on the evaluator performance.

Nix 2.18.1:

Executed in 13.19 secs fish external usr time 11.33 secs 0.12 millis 11.33 secs sys time 1.76 secs 1.38 millis 1.76 secs

This PR:

Executed in 13.57 secs fish external usr time 12.36 secs 0.15 millis 12.35 secs sys time 1.10 secs 2.19 millis 1.10 secs

3% slowdown. Seems OK for preventing a segfault.

3% slowdown. Seems OK for preventing a segfault.

3% isn't really trivial. But I reliably couldn't reproduce it:

Summary 'nixMaster' ran 1.00 ± 0.01 times faster than 'nixFromPR'

So I think it's fine

Oh, I was probably running an unoptimized build. Thanks for checking it on your end!

thufschmitt · 2023-12-18T10:04:23Z

src/libutil/error.cc

+                    }
+                } else {
+                    oss << "\n" << ANSI_WARNING "(" << skippedTraces.size() << " duplicate traces omitted)" ANSI_NORMAL << "\n";
+                    tracesSeen.clear();


(nit): Why do we care about that (and likely clearing skippedTraces below)?

I've added a comment to explain this, but I'll copy it here:

Consider a mutually recursive stack trace with:

10 entries of A

10 entries of B

10 entries of A

If we don't clear tracesSeen here, we would print output like this:

1 entry of A

(9 duplicate traces omitted)

1 entry of B

(19 duplicate traces omitted)

This obscures the control flow, which went from A, to B, and back to A again.

In contrast, if we do clear tracesSeen, the output looks like this:

1 entry of A

(9 duplicate traces omitted)

1 entry of B

(9 duplicate traces omitted)

1 entry of A

(9 duplicate traces omitted)

See: tests/functional/lang/eval-fail-mutual-recursion.nix for a test case exercising this property.

roberth · 2023-12-18T10:29:23Z

For a bigger picture please also read #9627

src/libexpr/eval.hh

tfc · 2023-12-18T10:34:41Z

src/libutil/error.cc

+        return lhs.line < rhs.line;
+    if (lhs.column != rhs.column)
+        return lhs.column < rhs.column;
+    return false;


What other cases can occur here other than line and column comparison? This makes the code and semantics harder to understand if the code leaves open what else can lead to the false return.

tfc · 2023-12-18T10:37:36Z

src/libutil/error.cc

+inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)
+{
+    if (lhs.line != rhs.line)
+        return lhs.line < rhs.line;
+    if (lhs.column != rhs.column)
+        return lhs.column < rhs.column;
+    return false;
+}
+inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }
+inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }
+inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }
+
+inline bool operator<(const Trace& lhs, const Trace& rhs)
+{
+    if (lhs.pos != rhs.pos) {
+        if (!lhs.pos)
+            return true;
+        if (!rhs.pos)
+            return false;
+        return *lhs.pos < *rhs.pos;
+    }
+
+    if (lhs.hint.str() != rhs.hint.str())
+        return lhs.hint.str() < rhs.hint.str();
+
+    if (lhs.frame != rhs.frame)
+        return lhs.frame < rhs.frame;
+
+    return false;
+}
+inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }
+inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }
+inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }


In the C++20 future, one might want to consider using the three-way comparison operator <=>. Its implementation allows the compiler to deduce the rest.
This will have multiple advantages:

we can get rid of preprocessor macros

strange semantics that are already implemented individually in the different operator implementation might surface in the light of a unified implementation of the <=> op

roberth

A stack limit is a good thing to have, regardless of the other steps we can take (ie #9627 and related issues).
We do need a setting for the limit to make sure we don't regress. See comment.

Scoped out: a page about stack behavior in the manual. This is very needed, but it would increase the scope too much. It would be appreciated very much (and improve and shorten the release note fwiw).

doc/manual/rl-next/stack-overflow-segfaults.md

src/libutil/error.cc

tests/functional/lang/eval-fail-infinite-recursion-lambda.err.exp

tests/functional/lang/eval-fail-mutual-recursion.nix

roberth · 2023-12-19T15:33:00Z

src/libutil/error.cc

+
+        printSkippedTracesMaybe();
        oss << "\n" << prefix;
    }


thought: If the algorithm could be factored out, it'd be easier to understand and improve.

thought: One possible improvement is to make it understand that clusters of repetition can themselves repeat. This happens when you're doing some iterative thing for each node in a tree for example. Probably there's an easier improvement that I'm not thinking of.

Factored it out, although it should maybe be a class to avoid passing around so much context.

One possible improvement is to make it understand that clusters of repetition can themselves repeat.

Yeah, I considered this but also wasn't sure of an elegant implementation. I'll keep turning it over.

roberth

If you could rename EvalState::depth to EvalState::callDepth and review my suggestions for comments, I think this is good to go.

src/libexpr/eval.hh

src/libutil/error.cc

roberth

Final breadcrumb

EDIT: looks like something was lost. This comment was only supposed to summarize a review with this comment.

src/libexpr/eval.cc

This fixes a segfault on infinite function call recursion (rather than infinite thunk recursion) by tracking the function call depth in `EvalState`. Additionally, to avoid printing extremely long stack traces, stack frames are now deduplicated, with a `(19997 duplicate traces omitted)` message. This should only really be triggered in infinite recursion scenarios. Before: $ nix-instantiate --eval --expr '(x: x x) (x: x x)' Segmentation fault: 11 After: $ nix-instantiate --eval --expr '(x: x x) (x: x x)' error: stack overflow at «string»:1:14: 1| (x: x x) (x: x x) | ^ $ nix-instantiate --eval --expr '(x: x x) (x: x x)' --show-trace error: … from call site at «string»:1:1: 1| (x: x x) (x: x x) | ^ … while calling anonymous lambda at «string»:1:2: 1| (x: x x) (x: x x) | ^ … from call site at «string»:1:5: 1| (x: x x) (x: x x) | ^ … while calling anonymous lambda at «string»:1:11: 1| (x: x x) (x: x x) | ^ … from call site at «string»:1:14: 1| (x: x x) (x: x x) | ^ (19997 duplicate traces omitted) error: stack overflow at «string»:1:14: 1| (x: x x) (x: x x) | ^

Addressed

roberth · 2023-12-30T09:40:42Z

Thank you @9999years!

Fix segfault on infinite recursion in some cases (cherry picked from commit bf1b294) Change-Id: Id137541426ec8536567835953fccf986a3aebf16

9999years requested a review from edolstra as a code owner December 15, 2023 21:29

github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Dec 15, 2023

9999years commented Dec 15, 2023

View reviewed changes

src/libexpr/eval.cc Outdated Show resolved Hide resolved

lf- reviewed Dec 15, 2023

View reviewed changes

9999years force-pushed the stack-overflow-segfault branch 3 times, most recently from fe7c5de to 9bb7992 Compare December 16, 2023 00:54

thufschmitt previously requested changes Dec 18, 2023

View reviewed changes

roberth mentioned this pull request Dec 18, 2023

Deterministic or guaranteed recursion depth (stack size) #9627

Open

tfc reviewed Dec 18, 2023

View reviewed changes

src/libexpr/eval.hh Outdated Show resolved Hide resolved

tfc reviewed Dec 18, 2023

View reviewed changes

roberth added error-messages Confusing messages and better diagnostics language The Nix expression language; parser, interpreter, primops, evaluation, etc bug labels Dec 18, 2023

9999years force-pushed the stack-overflow-segfault branch 2 times, most recently from a359356 to 4c114a2 Compare December 18, 2023 18:28

9999years requested a review from roberth December 18, 2023 21:38

9999years force-pushed the stack-overflow-segfault branch from 4c114a2 to 852bd1d Compare December 18, 2023 21:49

roberth suggested changes Dec 19, 2023

View reviewed changes

9999years force-pushed the stack-overflow-segfault branch 2 times, most recently from 12fcf78 to be44f2e Compare December 19, 2023 20:57

9999years requested a review from roberth December 20, 2023 17:09

roberth reviewed Dec 24, 2023

View reviewed changes

src/libexpr/eval.hh Outdated Show resolved Hide resolved

src/libutil/error.cc Show resolved Hide resolved

src/libutil/error.cc Outdated Show resolved Hide resolved

9999years force-pushed the stack-overflow-segfault branch from 587417c to 8d39b17 Compare December 24, 2023 23:01

roberth reviewed Dec 25, 2023

View reviewed changes

roberth reviewed Dec 29, 2023

View reviewed changes

src/libexpr/eval.cc Outdated Show resolved Hide resolved

9999years force-pushed the stack-overflow-segfault branch from 6cd3513 to 7434cac Compare December 30, 2023 06:16

9999years requested a review from roberth December 30, 2023 06:17

roberth approved these changes Dec 30, 2023

View reviewed changes

roberth merged commit bf1b294 into NixOS:master Dec 30, 2023

roberth mentioned this pull request Mar 13, 2024

coerceToString can overflow the stack, the bad way #10240

Open

Uh oh!

Comments

Conversation

9999years commented Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Future work

Priorities

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thufschmitt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

9999years Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roberth commented Dec 18, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roberth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

9999years commented Dec 15, 2023 •

edited

Loading

9999years Dec 18, 2023 •

edited

Loading

roberth left a comment •

edited

Loading