Remove the use of panicwrap in the Terraform CLI#29825
Conversation
Stop using panicwrap, and execute terraform in the main process.
Repurpose logging.PanicHandler to recover internal panics and print a message to users similar to panicwrap.
We no longer have to hide these from panicwrap
internal/logging/panic.go
Outdated
| // When called from a deferred function, debug.PrintStack will include the | ||
| // full stack from the point of the pending panic. | ||
| debug.PrintStack() | ||
| os.Exit(2) |
There was a problem hiding this comment.
My mental model of how the Go runtime handles panics is a bit incomplete, but my understanding was that an un-recovered panic normally unwinds right to the top of the goroutine's call stack and only exits if it gets to the top without being recovered.
If I'm correct about that, is it true to say that the only correct way to use this PanicHandler is as a defer PanicHandler() as the very first statement in a new goroutine, so that we can avoid this os.Exit inhibiting any other pending defers? If so, I think it would help to be explicit about that in the docstring above.
I'm also curious as to whether exiting here sufficiently mimics what the Go runtime would do with an unrecovered panic if there are still other non-panicing goroutines running at the time of panic. Does the runtime normally just interrupt those wherever they are and exit immediately, or does it first force them to unwind (via something like runtime.Goexit)? Perhaps this is an immaterial question anyway given how we're using this, but my general worry here is that subtle changes to how we react to panics might upset assumptions we've made elsewhere; convincing ourselves that there are no such assumptions would also be fine. 😀
Finally, it occurs to me that if we're now taking control of the panic exit in most codepaths this could be an opportunity to address our long-standing wart that we happened to select the same exit code for terraform plan -detailed-exitcode to signal that there are no changes to make, and so currently a panic there reads to automation as a success with no changes. Although there might still be places where normal panic handling escapes and thus we still exit with code 2 anyway, maybe we'd like to select a different code for the common case where we handle the panic through here, so that automation can still treat it as an error then.
There was a problem hiding this comment.
Yes, a panic unwinds to the very top of the call stack within the goroutine, so the handler does need to be the first defer in the stack, so I can add a note about that specifically.
The runtime exits with exit(2) once the call stack has been unwound to the top of the goroutine (which is why I used 2 here too), so any other running goroutines are aborted immediately as the process exits, and should not different significantly from out behavior here. Perhaps a different exit code would work here for the sake of -detailed-exitcode, though differing from the standard runtime code may make this less consistent than we hope.
There was a problem hiding this comment.
Given that our goal here is for most panics to be caught by our handler, and for missed ones to be exceptional, I'd personally vote for using a different exit code so that most of the time our behavior isn't ambiguous if we panic during planning, and the ambiguity can therefore be exceptional. The fact that Terraform is written in Go is an implementation detail rather than a contract, so I don't think we get any strong value from being consistent with the Go runtime behavior here.
With that said, it's an opinion loosely held so I'm happy to go ahead without that if you prefer. We can always change the exit status we use here later anyway, if we change our minds, since we don't make any contractural promises about how we report different kinds of errors via exit codes.
There was a problem hiding this comment.
I'm not opposed to changing the exit code here, my next idea would be to use the slightly system dependent -1, which is what panicwrap was substituting.
apparentlymart
left a comment
There was a problem hiding this comment.
This looks good to me!
I have some small worries about landing something subtle and cross-cutting like this right at the end of our v1.1 development window, but I suppose the beta/RC period will hopefully give some opportunities to learn whether we've accidentally been depending on the panicwrap forking and weird pipe stuff for anything. (I can't think of anything myself, but we've been routinely running in a child process with a piped stderr for so long now that there is some risk that panicwrap has accidentally become a crucial structural component of some other subsystem rather than just an implementation detail.)
|
Since |
|
Merging as-is for now. If we find the exit code isn't satisfactory that can be updated later, or er could choose to simply re-raise the panic and allow the runtime to exit normally and continue to deal with the conflicting exit code of 2. |
|
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions. |
While panicwrap was useful in preventing opaque stack traces from being dumped on users, it disconnected the running process from the terminal and original io streams, leaving us with many subtle behavioral changes from what would be expected when executing the process directly. This has led to numerous bugs, leaving us with clumsy workarounds, and edge cases that still cannot be resolved for all supported systems.
Here we remove
panicwrap, and execute the terraform cli directly in the main process. This also allows the removal ofwrappedstreams,wrappedreadlineand all the special terminal handling associated with running terraform from a child process of panicwrap.The existing
logging.PanicHandlerhas been repurposed to become a generic panic recovery mechanism, loosely emulating the behavior ofpanicwrap. It is inserted in all the most common goroutine dispatch points, where panics are most likely to happen. The stack trace will be prefixed by a simple message indicate that a crash has occurred, and to pleas report the issue. Due to the way we must regenerate the stack there will be a couple extra frames at the top, but since it's only for our consumption, that should not be an issue.A possible enhancement here to use the internal
tfdiags.Diagnosticstype to format an error more consistent with the rest of the CLI output. This however would require getting the correct view for formatting into the locations where we dispatch goroutines, which at the lowest level of the concurrent graph walk is very deep in the call stack. The alternative of collecting the panic and passing it back as an error is also possible, but the fact that an uncaughtpanicis intended to immediately halt execution because some portion of the program is in an unexpected state means that we risk compounding the problem by not exiting immediately. Perhaps if we can assure ourselves that these layers are sufficiently isolated where handling the error poses little risk, we could turn these into errors, though that has other consequences for code complexity by getting errors generated during deferred called in goroutine back into the normal data flow.Provider panics are handled in a similar manner, but are unaffected by this change since that mechanism was separated into its own handler during the logging subsystem refactoring.
We also opt to no longer collect all trace logs in a temporary file, and copy them to the working directory when a crash in encountered. With the current stability of terraform we see very few unexplainable crashes that rely on the accompanying logs to troubleshoot, and the tradeoff of not silently collecting large amounts of logs containing possibly sensitive information on disk seems worth the trade off.