Skip to content

Remove the use of panicwrap in the Terraform CLI#29825

Merged
jbardin merged 8 commits intomainfrom
jbardin/no-panicwrap
Oct 29, 2021
Merged

Remove the use of panicwrap in the Terraform CLI#29825
jbardin merged 8 commits intomainfrom
jbardin/no-panicwrap

Conversation

@jbardin
Copy link
Copy Markdown
Member

@jbardin jbardin commented Oct 28, 2021

While panicwrap was useful in preventing opaque stack traces from being dumped on users, it disconnected the running process from the terminal and original io streams, leaving us with many subtle behavioral changes from what would be expected when executing the process directly. This has led to numerous bugs, leaving us with clumsy workarounds, and edge cases that still cannot be resolved for all supported systems.

Here we remove panicwrap, and execute the terraform cli directly in the main process. This also allows the removal of wrappedstreams, wrappedreadline and all the special terminal handling associated with running terraform from a child process of panicwrap.

The existing logging.PanicHandler has been repurposed to become a generic panic recovery mechanism, loosely emulating the behavior of panicwrap. It is inserted in all the most common goroutine dispatch points, where panics are most likely to happen. The stack trace will be prefixed by a simple message indicate that a crash has occurred, and to pleas report the issue. Due to the way we must regenerate the stack there will be a couple extra frames at the top, but since it's only for our consumption, that should not be an issue.

!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!

Terraform crashed! This is always indicative of a bug within Terraform.
Please report the crash with Terraform[1] so that we can fix this.

When reporting bugs, please include your terraform version, the stack trace
shown below, and any additional information which may help replicate the issue.

[1]: https://github.com/hashicorp/terraform/issues

!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!
OOPSgoroutine 90 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x88
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x20
github.com/hashicorp/terraform/internal/logging.PanicHandler()
	/Users/jbardin/terraform/internal/logging/panic.go:44 +0xe0
panic({0x1065d5140, 0x106a0c138})
	/usr/local/go/src/runtime/panic.go:1038 +0x21c
github.com/hashicorp/terraform/internal/terraform.(*NodeValidatableResource).Execute(0x14000786208, {0x106ae83a0, 0x14000328700}, 0x4)
	/Users/jbardin/terraform/internal/terraform/node_resource_validate.go:43 +0xa8
github.com/hashicorp/terraform/internal/terraform.(*ContextGraphWalker).Execute(0x140003280e0, {0x106ae83a0, 0x14000328700}, {0x12f5e4758, 0x14000786208})
	/Users/jbardin/terraform/internal/terraform/graph_walk_context.go:133 +0x9c
github.com/hashicorp/terraform/internal/terraform.(*Graph).walk.func1({0x10698f860, 0x14000786208})
	/Users/jbardin/terraform/internal/terraform/graph.go:74 +0x244
github.com/hashicorp/terraform/internal/dag.(*Walker).walkVertex(0x1400051b320, {0x10698f860, 0x14000786208}, 0x14000323400)
	/Users/jbardin/terraform/internal/dag/walk.go:381 +0x31c
created by github.com/hashicorp/terraform/internal/dag.(*Walker).Update
	/Users/jbardin/terraform/internal/dag/walk.go:304 +0xe0c

A possible enhancement here to use the internal tfdiags.Diagnostics type to format an error more consistent with the rest of the CLI output. This however would require getting the correct view for formatting into the locations where we dispatch goroutines, which at the lowest level of the concurrent graph walk is very deep in the call stack. The alternative of collecting the panic and passing it back as an error is also possible, but the fact that an uncaught panic is intended to immediately halt execution because some portion of the program is in an unexpected state means that we risk compounding the problem by not exiting immediately. Perhaps if we can assure ourselves that these layers are sufficiently isolated where handling the error poses little risk, we could turn these into errors, though that has other consequences for code complexity by getting errors generated during deferred called in goroutine back into the normal data flow.

Provider panics are handled in a similar manner, but are unaffected by this change since that mechanism was separated into its own handler during the logging subsystem refactoring.

We also opt to no longer collect all trace logs in a temporary file, and copy them to the working directory when a crash in encountered. With the current stability of terraform we see very few unexplainable crashes that rely on the accompanying logs to troubleshoot, and the tradeoff of not silently collecting large amounts of logs containing possibly sensitive information on disk seems worth the trade off.

Stop using panicwrap, and execute terraform in the main process.
Repurpose logging.PanicHandler to recover internal panics and print a
message to users similar to panicwrap.
We no longer have to hide these from panicwrap
@jbardin jbardin requested a review from a team October 28, 2021 15:59
@jbardin jbardin self-assigned this Oct 28, 2021
// When called from a deferred function, debug.PrintStack will include the
// full stack from the point of the pending panic.
debug.PrintStack()
os.Exit(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mental model of how the Go runtime handles panics is a bit incomplete, but my understanding was that an un-recovered panic normally unwinds right to the top of the goroutine's call stack and only exits if it gets to the top without being recovered.

If I'm correct about that, is it true to say that the only correct way to use this PanicHandler is as a defer PanicHandler() as the very first statement in a new goroutine, so that we can avoid this os.Exit inhibiting any other pending defers? If so, I think it would help to be explicit about that in the docstring above.

I'm also curious as to whether exiting here sufficiently mimics what the Go runtime would do with an unrecovered panic if there are still other non-panicing goroutines running at the time of panic. Does the runtime normally just interrupt those wherever they are and exit immediately, or does it first force them to unwind (via something like runtime.Goexit)? Perhaps this is an immaterial question anyway given how we're using this, but my general worry here is that subtle changes to how we react to panics might upset assumptions we've made elsewhere; convincing ourselves that there are no such assumptions would also be fine. 😀

Finally, it occurs to me that if we're now taking control of the panic exit in most codepaths this could be an opportunity to address our long-standing wart that we happened to select the same exit code for terraform plan -detailed-exitcode to signal that there are no changes to make, and so currently a panic there reads to automation as a success with no changes. Although there might still be places where normal panic handling escapes and thus we still exit with code 2 anyway, maybe we'd like to select a different code for the common case where we handle the panic through here, so that automation can still treat it as an error then.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a panic unwinds to the very top of the call stack within the goroutine, so the handler does need to be the first defer in the stack, so I can add a note about that specifically.

The runtime exits with exit(2) once the call stack has been unwound to the top of the goroutine (which is why I used 2 here too), so any other running goroutines are aborted immediately as the process exits, and should not different significantly from out behavior here. Perhaps a different exit code would work here for the sake of -detailed-exitcode, though differing from the standard runtime code may make this less consistent than we hope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that our goal here is for most panics to be caught by our handler, and for missed ones to be exceptional, I'd personally vote for using a different exit code so that most of the time our behavior isn't ambiguous if we panic during planning, and the ambiguity can therefore be exceptional. The fact that Terraform is written in Go is an implementation detail rather than a contract, so I don't think we get any strong value from being consistent with the Go runtime behavior here.

With that said, it's an opinion loosely held so I'm happy to go ahead without that if you prefer. We can always change the exit status we use here later anyway, if we change our minds, since we don't make any contractural promises about how we report different kinds of errors via exit codes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to changing the exit code here, my next idea would be to use the slightly system dependent -1, which is what panicwrap was substituting.

Copy link
Copy Markdown
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

I have some small worries about landing something subtle and cross-cutting like this right at the end of our v1.1 development window, but I suppose the beta/RC period will hopefully give some opportunities to learn whether we've accidentally been depending on the panicwrap forking and weird pipe stuff for anything. (I can't think of anything myself, but we've been routinely running in a child process with a piped stderr for so long now that there is some risk that panicwrap has accidentally become a crucial structural component of some other subsystem rather than just an implementation detail.)

@jbardin
Copy link
Copy Markdown
Member Author

jbardin commented Oct 28, 2021

Since -1 isn't technically a valid exit code, I updated the handler to exit with 11, which avoids the detailed exitcode values, and happens to also be SIGSEGV, which is roughly analogous to what causes most panics.

@jbardin
Copy link
Copy Markdown
Member Author

jbardin commented Oct 29, 2021

Merging as-is for now. If we find the exit code isn't satisfactory that can be updated later, or er could choose to simply re-raise the panic and allow the runtime to exit normally and continue to deal with the conflicting exit code of 2.

@jbardin jbardin merged commit 81e709d into main Oct 29, 2021
@jbardin jbardin deleted the jbardin/no-panicwrap branch October 29, 2021 18:41
@apparentlymart apparentlymart added this to the v1.1.0 milestone Nov 1, 2021
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2021

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants