Remove the use of panicwrap in the Terraform CLI by jbardin · Pull Request #29825 · hashicorp/terraform

jbardin · 2021-10-28T15:59:23Z

While panicwrap was useful in preventing opaque stack traces from being dumped on users, it disconnected the running process from the terminal and original io streams, leaving us with many subtle behavioral changes from what would be expected when executing the process directly. This has led to numerous bugs, leaving us with clumsy workarounds, and edge cases that still cannot be resolved for all supported systems.

Here we remove panicwrap, and execute the terraform cli directly in the main process. This also allows the removal of wrappedstreams, wrappedreadline and all the special terminal handling associated with running terraform from a child process of panicwrap.

The existing logging.PanicHandler has been repurposed to become a generic panic recovery mechanism, loosely emulating the behavior of panicwrap. It is inserted in all the most common goroutine dispatch points, where panics are most likely to happen. The stack trace will be prefixed by a simple message indicate that a crash has occurred, and to pleas report the issue. Due to the way we must regenerate the stack there will be a couple extra frames at the top, but since it's only for our consumption, that should not be an issue.

!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!

Terraform crashed! This is always indicative of a bug within Terraform.
Please report the crash with Terraform[1] so that we can fix this.

When reporting bugs, please include your terraform version, the stack trace
shown below, and any additional information which may help replicate the issue.

[1]: https://github.com/hashicorp/terraform/issues

!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!
OOPSgoroutine 90 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x88
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x20
github.com/hashicorp/terraform/internal/logging.PanicHandler()
	/Users/jbardin/terraform/internal/logging/panic.go:44 +0xe0
panic({0x1065d5140, 0x106a0c138})
	/usr/local/go/src/runtime/panic.go:1038 +0x21c
github.com/hashicorp/terraform/internal/terraform.(*NodeValidatableResource).Execute(0x14000786208, {0x106ae83a0, 0x14000328700}, 0x4)
	/Users/jbardin/terraform/internal/terraform/node_resource_validate.go:43 +0xa8
github.com/hashicorp/terraform/internal/terraform.(*ContextGraphWalker).Execute(0x140003280e0, {0x106ae83a0, 0x14000328700}, {0x12f5e4758, 0x14000786208})
	/Users/jbardin/terraform/internal/terraform/graph_walk_context.go:133 +0x9c
github.com/hashicorp/terraform/internal/terraform.(*Graph).walk.func1({0x10698f860, 0x14000786208})
	/Users/jbardin/terraform/internal/terraform/graph.go:74 +0x244
github.com/hashicorp/terraform/internal/dag.(*Walker).walkVertex(0x1400051b320, {0x10698f860, 0x14000786208}, 0x14000323400)
	/Users/jbardin/terraform/internal/dag/walk.go:381 +0x31c
created by github.com/hashicorp/terraform/internal/dag.(*Walker).Update
	/Users/jbardin/terraform/internal/dag/walk.go:304 +0xe0c

A possible enhancement here to use the internal tfdiags.Diagnostics type to format an error more consistent with the rest of the CLI output. This however would require getting the correct view for formatting into the locations where we dispatch goroutines, which at the lowest level of the concurrent graph walk is very deep in the call stack. The alternative of collecting the panic and passing it back as an error is also possible, but the fact that an uncaught panic is intended to immediately halt execution because some portion of the program is in an unexpected state means that we risk compounding the problem by not exiting immediately. Perhaps if we can assure ourselves that these layers are sufficiently isolated where handling the error poses little risk, we could turn these into errors, though that has other consequences for code complexity by getting errors generated during deferred called in goroutine back into the normal data flow.

Provider panics are handled in a similar manner, but are unaffected by this change since that mechanism was separated into its own handler during the logging subsystem refactoring.

We also opt to no longer collect all trace logs in a temporary file, and copy them to the working directory when a crash in encountered. With the current stability of terraform we see very few unexplainable crashes that rely on the accompanying logs to troubleshoot, and the tradeoff of not silently collecting large amounts of logs containing possibly sensitive information on disk seems worth the trade off.

Stop using panicwrap, and execute terraform in the main process.

Repurpose logging.PanicHandler to recover internal panics and print a message to users similar to panicwrap.

We no longer have to hide these from panicwrap

apparentlymart · 2021-10-28T16:54:51Z

internal/logging/panic.go

+	// When called from a deferred function, debug.PrintStack will include the
+	// full stack from the point of the pending panic.
+	debug.PrintStack()
+	os.Exit(2)


My mental model of how the Go runtime handles panics is a bit incomplete, but my understanding was that an un-recovered panic normally unwinds right to the top of the goroutine's call stack and only exits if it gets to the top without being recovered.

If I'm correct about that, is it true to say that the only correct way to use this PanicHandler is as a defer PanicHandler() as the very first statement in a new goroutine, so that we can avoid this os.Exit inhibiting any other pending defers? If so, I think it would help to be explicit about that in the docstring above.

I'm also curious as to whether exiting here sufficiently mimics what the Go runtime would do with an unrecovered panic if there are still other non-panicing goroutines running at the time of panic. Does the runtime normally just interrupt those wherever they are and exit immediately, or does it first force them to unwind (via something like runtime.Goexit)? Perhaps this is an immaterial question anyway given how we're using this, but my general worry here is that subtle changes to how we react to panics might upset assumptions we've made elsewhere; convincing ourselves that there are no such assumptions would also be fine. 😀

Finally, it occurs to me that if we're now taking control of the panic exit in most codepaths this could be an opportunity to address our long-standing wart that we happened to select the same exit code for terraform plan -detailed-exitcode to signal that there are no changes to make, and so currently a panic there reads to automation as a success with no changes. Although there might still be places where normal panic handling escapes and thus we still exit with code 2 anyway, maybe we'd like to select a different code for the common case where we handle the panic through here, so that automation can still treat it as an error then.

Yes, a panic unwinds to the very top of the call stack within the goroutine, so the handler does need to be the first defer in the stack, so I can add a note about that specifically.

The runtime exits with exit(2) once the call stack has been unwound to the top of the goroutine (which is why I used 2 here too), so any other running goroutines are aborted immediately as the process exits, and should not different significantly from out behavior here. Perhaps a different exit code would work here for the sake of -detailed-exitcode, though differing from the standard runtime code may make this less consistent than we hope.

Given that our goal here is for most panics to be caught by our handler, and for missed ones to be exceptional, I'd personally vote for using a different exit code so that most of the time our behavior isn't ambiguous if we panic during planning, and the ambiguity can therefore be exceptional. The fact that Terraform is written in Go is an implementation detail rather than a contract, so I don't think we get any strong value from being consistent with the Go runtime behavior here.

With that said, it's an opinion loosely held so I'm happy to go ahead without that if you prefer. We can always change the exit status we use here later anyway, if we change our minds, since we don't make any contractural promises about how we report different kinds of errors via exit codes.

I'm not opposed to changing the exit code here, my next idea would be to use the slightly system dependent -1, which is what panicwrap was substituting.

apparentlymart

This looks good to me!

I have some small worries about landing something subtle and cross-cutting like this right at the end of our v1.1 development window, but I suppose the beta/RC period will hopefully give some opportunities to learn whether we've accidentally been depending on the panicwrap forking and weird pipe stuff for anything. (I can't think of anything myself, but we've been routinely running in a child process with a piped stderr for so long now that there is some risk that panicwrap has accidentally become a crucial structural component of some other subsystem rather than just an implementation detail.)

jbardin · 2021-10-28T19:38:51Z

Since -1 isn't technically a valid exit code, I updated the handler to exit with 11, which avoids the detailed exitcode values, and happens to also be SIGSEGV, which is roughly analogous to what causes most panics.

jbardin · 2021-10-29T18:41:01Z

Merging as-is for now. If we find the exit code isn't satisfactory that can be updated later, or er could choose to simply re-raise the panic and allow the runtime to exit normally and continue to deal with the conflicting exit code of 2.

github-actions · 2021-12-02T02:06:59Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin added 6 commits October 28, 2021 11:51

remove the use of panicwrap

622c4df

Stop using panicwrap, and execute terraform in the main process.

remove wrapped streams and readline

42742c1

logging.PanicHandler can now be for internal use

3f31533

Repurpose logging.PanicHandler to recover internal panics and print a message to users similar to panicwrap.

insert panic handlers

d03a037

no need for TF_FORK=0

f031fca

write provider panics to error

d2d2508

We no longer have to hide these from panicwrap

jbardin requested a review from a team October 28, 2021 15:59

jbardin self-assigned this Oct 28, 2021

apparentlymart reviewed Oct 28, 2021

View reviewed changes

apparentlymart approved these changes Oct 28, 2021

View reviewed changes

jbardin added 2 commits October 28, 2021 13:56

add note about calling PanicHandler first

bd37f43

use 11 for the panic exit code

583e3a5

jbardin merged commit 81e709d into main Oct 29, 2021

jbardin deleted the jbardin/no-panicwrap branch October 29, 2021 18:41

jbardin mentioned this pull request Oct 29, 2021

Can't redirect or pipe terraform console input on Windows #29770

Closed

apparentlymart added this to the v1.1.0 milestone Nov 1, 2021

github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the use of panicwrap in the Terraform CLI#29825

Remove the use of panicwrap in the Terraform CLI#29825
jbardin merged 8 commits intomainfrom
jbardin/no-panicwrap

jbardin commented Oct 28, 2021

Uh oh!

apparentlymart Oct 28, 2021

Uh oh!

jbardin Oct 28, 2021

Uh oh!

apparentlymart Oct 28, 2021

Uh oh!

jbardin Oct 28, 2021

Uh oh!

apparentlymart left a comment

Uh oh!

jbardin commented Oct 28, 2021

Uh oh!

jbardin commented Oct 29, 2021

Uh oh!

github-actions bot commented Dec 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbardin commented Oct 28, 2021

Uh oh!

apparentlymart Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

jbardin Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

apparentlymart Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

jbardin Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

apparentlymart left a comment

Choose a reason for hiding this comment

Uh oh!

jbardin commented Oct 28, 2021

Uh oh!

jbardin commented Oct 29, 2021

Uh oh!

github-actions bot commented Dec 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants