Watchdog action that will signal a particular thread to abort.#12860
Watchdog action that will signal a particular thread to abort.#12860mattklein123 merged 14 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
antoniovicente
left a comment
There was a problem hiding this comment.
Thanks for implementing this, it should make it easier to get the stack of the stuck thread when the guarddog triggers.
api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto
Outdated
Show resolved
Hide resolved
api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto
Outdated
Show resolved
Hide resolved
| // Signal to test thread that tid has been set. | ||
| { | ||
| absl::MutexLock lock(&mutex_); | ||
| outstanding_notifies_ += 1; |
There was a problem hiding this comment.
Consider using absl::Notification instead of implementing your own version.
Here you could call n.Notify();
You would call n.WaitForNotification() before code that depends on tid.
| action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now); | ||
| }; | ||
|
|
||
| EXPECT_DEATH(die_function(), ""); |
There was a problem hiding this comment.
I assume there's no specific string that we can assert in the EXPECT_DEATH
Does Envoy install a failure signal handler? Should we consider testing by installing a signal handler for SIGABRT that records which thread received the SIGABRT signal?
There was a problem hiding this comment.
Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)
I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.
There was a problem hiding this comment.
It seems to me that right now you can't distinguish between death due to the kill signal vs the call to panic. The kill signal code could be removed and this test would still pass.
|
|
||
| class AbortActionTest : public testing::Test { | ||
| protected: | ||
| AbortActionTest() |
There was a problem hiding this comment.
You may want to restore signal handlers in the test destructor.
There was a problem hiding this comment.
since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.
Signed-off-by: Kevin Baichoo <[email protected]>
KBaichoo
left a comment
There was a problem hiding this comment.
Thanks for the code review
|
|
||
| // Abort from the action since the signaled thread hasn't yet crashed the process. | ||
| PANIC( | ||
| fmt::format("Failed to kill thread with id {}, aborting from AbortAction instead.", raw_tid)); |
There was a problem hiding this comment.
There are 3 reasons I think this would be handy (vs deferring to the underlying watchdog default actions):
- Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
- It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
- Allows flexibility for using this in places where there isn't a default PANIC afterwards
| // Assume POSIX-compatible system and signal to the thread. | ||
| ENVOY_LOG_MISC(error, "AbortAction sending abort signal to thread with tid {}.", raw_tid); | ||
|
|
||
| if (kill(toPlatformTid(raw_tid), SIGABRT) == 0) { |
There was a problem hiding this comment.
Yep, I think I'll keep it simple for now given that the default envoy handlers views SIGABRT as a fatal signal.
| * This is currently only implemented for systems that support kill to send | ||
| * signals. | ||
| */ | ||
| class AbortAction : public Server::Configuration::GuardDogAction { |
There was a problem hiding this comment.
The terminology of these two get a bit messy; in the docs and at a high level we talk about WatchDog and the Watch dog system, while in the actual implementation we have a watchdog per thread and a guarddog that manages the watchdogs and will actually be executing these functions.
I went with Watchdog since it seemed more friendly to folks who aren't digging down into the implementation details.
|
|
||
| class AbortActionTest : public testing::Test { | ||
| protected: | ||
| AbortActionTest() |
There was a problem hiding this comment.
since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.
| // Signal to test thread that tid has been set. | ||
| { | ||
| absl::MutexLock lock(&mutex_); | ||
| outstanding_notifies_ += 1; |
| action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now); | ||
| }; | ||
|
|
||
| EXPECT_DEATH(die_function(), ""); |
There was a problem hiding this comment.
Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)
I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
|
|
||
| void handler(int sig, siginfo_t* /*siginfo*/, void* /*context*/) { | ||
| std::cout << "Eating signal :" << std::to_string(sig) << ". will ignore it." << std::endl; | ||
| signal(SIGABRT, SIG_IGN); |
There was a problem hiding this comment.
Is this call to signal(SIGABRT, SIG_IGN); required?
There was a problem hiding this comment.
Not strictly, but it prevents subsequent printouts of Eating signal:
|
PTAL @envoyproxy/senior-maintainers |
| // [#protodoc-title: Watchdog Action that sends a SIGABRT to kill the process.] | ||
| // [#extension: envoy.watchdog.abort_action] | ||
|
|
||
| // Configuration for the profile watchdog action. |
There was a problem hiding this comment.
Comment seems wrong/out of date. Note that this comment is the primary documentation for this in the generated docs. It should include some information about what this does, when/why use it, etc.
There was a problem hiding this comment.
Done. Added additional information from the action's c++ implementation as you suggested.
| namespace AbortAction { | ||
|
|
||
| /** | ||
| * A GuardDogAction that will terminate the process by sending SIGABRT to the |
There was a problem hiding this comment.
Perhaps some/all of this comment block should be included with the proto config, so it gets included in generated docs.
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
ggreenway
left a comment
There was a problem hiding this comment.
Code LGTM, unless we want to pursue making this the default.
| // more useful than the default watchdog kill behaviors since those PANIC | ||
| // from the watchdog's thread. | ||
|
|
||
| // This is currently only implemented for systems that support kill to send |
There was a problem hiding this comment.
Should this be the default, on supported platforms? Is there any downside to this? It seems to me that it gives better information (or a chance of it), and the same net behavior (process terminated). @envoyproxy/maintainers
There was a problem hiding this comment.
I'm fine with that being default in a future PR.
There was a problem hiding this comment.
But also fine here if we want this, seems generally useful.
There was a problem hiding this comment.
Sounds good, I can make it be a default action in a future PR.
The main downside I see of making it a default now would be different behaviors across platforms due to a lack of support on Windows. Should we wait for parity before we make it a default?
There was a problem hiding this comment.
I think it's ok to make it the default, because the only platform difference should be diagnostic output. The process is terminated on all platforms.
I'm also fine with making it the default in a future PR. It's your choice, @KBaichoo .
There was a problem hiding this comment.
Sounds good. I'll submit a follow up to this adding this action to the watchdog actions by default.
There was a problem hiding this comment.
+1 to make this the default.
|
PTAL @envoyproxy/api-shepherds |
|
/lgtm api |
mattklein123
left a comment
There was a problem hiding this comment.
Nice, cool stuff. LGTM with some small comments.
/wait-any
|
|
||
| #ifdef WIN32 | ||
| // TODO(kbaichoo): add support for this with windows. | ||
| ENVOY_LOG_MISC(error, "Watchdog AbortAction is unimplemented for Windows."); |
There was a problem hiding this comment.
Is this actually needed? I think Windows has its own extension bzl file so I'm not sure this is even compiled? Could we actually avoid the ifdef in here now or just replace with an proprocessor error if this extension gets compiled on windows?
There was a problem hiding this comment.
Good to know this is a possibility, thanks for pointing it out.
I've added the extension to WINDOWS_SKIP_TARGETS which IIUC will let it be skipped by windows so I can remove the ifdef #Win32 from code.
There was a problem hiding this comment.
turns out you also need to add tags = ["skip_on_windows"] on tests since excluding the extension in the .bzl doesn't effect the tests.
| // Abort from the action since the signaled thread hasn't yet crashed the process. | ||
| PANIC(fmt::format("Failed to kill thread with id {}, aborting from Watchdog AbortAction instead.", | ||
| raw_tid)); |
There was a problem hiding this comment.
I haven't kept fully up to date on this, but isn't the default action at the end of the action chain to terminate the process? Do we need this or would the action chain just end up aborting anyway?
There was a problem hiding this comment.
Yes, you're right. I discussed this with antonio up above. I've copied the reasoning below (copied from #12860 (comment))
There are 3 reasons I think this [having a panic in the action] would be handy (vs deferring to the underlying watchdog default actions[its panic there]):
- Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
- It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
- Allows flexibility for using this in places where there isn't a default PANIC afterwards
Given that we're planning to make this also a default action on platform where it's supported I'd likely make the following change in the next PR:
- If we're doing kill or multikill, then install the abort action as the final watchdog action for either of those events.
I could possibly see when this has full support across platforms that we'd remove the watchdog's default panic, since it'd end up being dead code.
There was a problem hiding this comment.
OK that make sense. Can you summarize this comment in the code? Thank you.
/wait
Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
|
Can you merge main? It should fix coverage. |
Signed-off-by: Kevin Baichoo <[email protected]>
|
Sorry coverage failure looks legit. Can you check? /wait |
|
the LCOV from the CI is: https://storage.googleapis.com/envoy-pr/12860/coverage/source/extensions/watchdog/abort_action/abort_action.cc.gcov.html But if we look at the tests, we do "cover" those lines, but they end up being wrapped in I will do a per extension percentage override to work around it. |
…H tests. Signed-off-by: Kevin Baichoo <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
mattklein123
left a comment
There was a problem hiding this comment.
I think death tests sometimes do count for coverage, but I don't know when/how that happens. It would be nice to fix this at some point but LGTM for now!
|
Still broken. :( /wait |
|
Ah, 🤦 the comment lines aren't "covered"... so the percentages have dropped slightly. |
… line coverage percent. Signed-off-by: Kevin Baichoo <[email protected]>
947bd06
Signed-off-by: Kevin Baichoo [email protected]
For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md
Commit Message: Watchdog action that will signal a particular thread to abort.
Additional Description:
Risk Level: medium
Testing: Unit tests
Docs Changes: Included
Release Notes: Included
Issue: #11388