Watchdog action that will signal a particular thread to abort. by KBaichoo · Pull Request #12860 · envoyproxy/envoy

KBaichoo · 2020-08-27T22:31:27Z

Signed-off-by: Kevin Baichoo [email protected]

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Commit Message: Watchdog action that will signal a particular thread to abort.
Additional Description:
Risk Level: medium
Testing: Unit tests
Docs Changes: Included
Release Notes: Included
Issue: #11388

Signed-off-by: Kevin Baichoo <[email protected]>

repokitteh-read-only · 2020-08-27T22:31:34Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12860 was opened by KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <[email protected]>

antoniovicente

Thanks for implementing this, it should make it easier to get the stack of the stuck thread when the guarddog triggers.

api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto

docs/root/version_history/current.rst

source/extensions/watchdog/abort_action/BUILD

source/extensions/watchdog/abort_action/abort_action.cc

source/extensions/watchdog/abort_action/config.h

antoniovicente · 2020-08-28T23:57:22Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+      // Signal to test thread that tid has been set.
+      {
+        absl::MutexLock lock(&mutex_);
+        outstanding_notifies_ += 1;


Consider using absl::Notification instead of implementing your own version.

Here you could call n.Notify();
You would call n.WaitForNotification() before code that depends on tid.

antoniovicente · 2020-08-28T23:58:38Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+    action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now);
+  };
+
+  EXPECT_DEATH(die_function(), "");


I assume there's no specific string that we can assert in the EXPECT_DEATH

Does Envoy install a failure signal handler? Should we consider testing by installing a signal handler for SIGABRT that records which thread received the SIGABRT signal?

Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)

I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.

It seems to me that right now you can't distinguish between death due to the kill signal vs the call to panic. The kill signal code could be removed and this test would still pass.

antoniovicente · 2020-08-29T00:15:30Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+
+class AbortActionTest : public testing::Test {
+protected:
+  AbortActionTest()


You may want to restore signal handlers in the test destructor.

since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.

Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo

Thanks for the code review

KBaichoo · 2020-08-31T19:33:07Z

source/extensions/watchdog/abort_action/abort_action.cc

+
+  // Abort from the action since the signaled thread hasn't yet crashed the process.
+  PANIC(
+      fmt::format("Failed to kill thread with id {}, aborting from AbortAction instead.", raw_tid));


There are 3 reasons I think this would be handy (vs deferring to the underlying watchdog default actions):

Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.

It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default

Allows flexibility for using this in places where there isn't a default PANIC afterwards

KBaichoo · 2020-08-31T19:35:49Z

source/extensions/watchdog/abort_action/abort_action.cc

+  // Assume POSIX-compatible system and signal to the thread.
+  ENVOY_LOG_MISC(error, "AbortAction sending abort signal to thread with tid {}.", raw_tid);
+
+  if (kill(toPlatformTid(raw_tid), SIGABRT) == 0) {


Yep, I think I'll keep it simple for now given that the default envoy handlers views SIGABRT as a fatal signal.

KBaichoo · 2020-08-31T19:44:33Z

source/extensions/watchdog/abort_action/abort_action.h

+ * This is currently only implemented for systems that support kill to send
+ * signals.
+ */
+class AbortAction : public Server::Configuration::GuardDogAction {


The terminology of these two get a bit messy; in the docs and at a high level we talk about WatchDog and the Watch dog system, while in the actual implementation we have a watchdog per thread and a guarddog that manages the watchdogs and will actually be executing these functions.

I went with Watchdog since it seemed more friendly to folks who aren't digging down into the implementation details.

KBaichoo · 2020-08-31T20:52:33Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+
+class AbortActionTest : public testing::Test {
+protected:
+  AbortActionTest()


since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.

KBaichoo · 2020-08-31T21:01:28Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+      // Signal to test thread that tid has been set.
+      {
+        absl::MutexLock lock(&mutex_);
+        outstanding_notifies_ += 1;


KBaichoo · 2020-08-31T21:18:46Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+    action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now);
+  };
+
+  EXPECT_DEATH(die_function(), "");


Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)

I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.

KBaichoo · 2020-09-01T12:56:51Z

/retest

repokitteh-read-only · 2020-09-01T12:56:56Z

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #12860 (comment) was created by @KBaichoo.

see: more, trace.

antoniovicente · 2020-09-03T21:18:48Z

test/extensions/watchdog/abort_action/abort_action_test.cc

+
+void handler(int sig, siginfo_t* /*siginfo*/, void* /*context*/) {
+  std::cout << "Eating signal :" << std::to_string(sig) << ". will ignore it." << std::endl;
+  signal(SIGABRT, SIG_IGN);


Is this call to signal(SIGABRT, SIG_IGN); required?

Not strictly, but it prevents subsequent printouts of Eating signal:

KBaichoo · 2020-09-03T22:10:52Z

PTAL @envoyproxy/senior-maintainers

ggreenway

/wait

ggreenway · 2020-09-09T16:47:08Z

api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto

+// [#protodoc-title: Watchdog Action that sends a SIGABRT to kill the process.]
+// [#extension: envoy.watchdog.abort_action]
+
+// Configuration for the profile watchdog action.


Comment seems wrong/out of date. Note that this comment is the primary documentation for this in the generated docs. It should include some information about what this does, when/why use it, etc.

Done. Added additional information from the action's c++ implementation as you suggested.

source/extensions/watchdog/abort_action/abort_action.cc

ggreenway · 2020-09-09T16:50:53Z

source/extensions/watchdog/abort_action/abort_action.h

+namespace AbortAction {
+
+/**
+ * A GuardDogAction that will terminate the process by sending SIGABRT to the


Perhaps some/all of this comment block should be included with the proto config, so it gets included in generated docs.

Signed-off-by: Kevin Baichoo <[email protected]>

ggreenway

Code LGTM, unless we want to pursue making this the default.

ggreenway · 2020-09-10T15:45:53Z

api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto

+// more useful than the default watchdog kill behaviors since those PANIC
+// from the watchdog's thread.
+
+// This is currently only implemented for systems that support kill to send


Should this be the default, on supported platforms? Is there any downside to this? It seems to me that it gives better information (or a chance of it), and the same net behavior (process terminated). @envoyproxy/maintainers

I'm fine with that being default in a future PR.

But also fine here if we want this, seems generally useful.

Sounds good, I can make it be a default action in a future PR.

The main downside I see of making it a default now would be different behaviors across platforms due to a lack of support on Windows. Should we wait for parity before we make it a default?

I think it's ok to make it the default, because the only platform difference should be diagnostic output. The process is terminated on all platforms.

I'm also fine with making it the default in a future PR. It's your choice, @KBaichoo .

Sounds good. I'll submit a follow up to this adding this action to the watchdog actions by default.

+1 to make this the default.

KBaichoo · 2020-09-11T19:52:44Z

PTAL @envoyproxy/api-shepherds

htuch · 2020-09-11T19:57:19Z

/lgtm api

mattklein123

Nice, cool stuff. LGTM with some small comments.

/wait-any

mattklein123 · 2020-09-12T00:03:41Z

source/extensions/watchdog/abort_action/abort_action.cc

+
+#ifdef WIN32
+  // TODO(kbaichoo): add support for this with windows.
+  ENVOY_LOG_MISC(error, "Watchdog AbortAction is unimplemented for Windows.");


Is this actually needed? I think Windows has its own extension bzl file so I'm not sure this is even compiled? Could we actually avoid the ifdef in here now or just replace with an proprocessor error if this extension gets compiled on windows?

Good to know this is a possibility, thanks for pointing it out.

I've added the extension to WINDOWS_SKIP_TARGETS which IIUC will let it be skipped by windows so I can remove the ifdef #Win32 from code.

turns out you also need to add tags = ["skip_on_windows"] on tests since excluding the extension in the .bzl doesn't effect the tests.

mattklein123 · 2020-09-12T00:04:49Z

source/extensions/watchdog/abort_action/abort_action.cc

+  // Abort from the action since the signaled thread hasn't yet crashed the process.
+  PANIC(fmt::format("Failed to kill thread with id {}, aborting from Watchdog AbortAction instead.",
+                    raw_tid));


I haven't kept fully up to date on this, but isn't the default action at the end of the action chain to terminate the process? Do we need this or would the action chain just end up aborting anyway?

Yes, you're right. I discussed this with antonio up above. I've copied the reasoning below (copied from #12860 (comment))

There are 3 reasons I think this [having a panic in the action] would be handy (vs deferring to the underlying watchdog default actions[its panic there]):

Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.

It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default

Allows flexibility for using this in places where there isn't a default PANIC afterwards

Given that we're planning to make this also a default action on platform where it's supported I'd likely make the following change in the next PR:

If we're doing kill or multikill, then install the abort action as the final watchdog action for either of those events.

I could possibly see when this has full support across platforms that we'd remove the watchdog's default panic, since it'd end up being dead code.

OK that make sense. Can you summarize this comment in the code? Thank you.

/wait

Signed-off-by: Kevin Baichoo <[email protected]>

mattklein123 · 2020-09-15T16:56:04Z

Can you merge main? It should fix coverage.

Signed-off-by: Kevin Baichoo <[email protected]>

mattklein123 · 2020-09-15T21:02:30Z

Sorry coverage failure looks legit.

Can you check?

/wait

KBaichoo · 2020-09-15T21:39:37Z

the LCOV from the CI is: https://storage.googleapis.com/envoy-pr/12860/coverage/source/extensions/watchdog/abort_action/abort_action.cc.gcov.html

But if we look at the tests, we do "cover" those lines, but they end up being wrapped in EXPECT_DEATH since it's going to kill the process. EXPECT_DEATH code paths are run in a sub-process of the main unit testing process. The code coverage tools will never report full coverage for most of the test then.

I will do a per extension percentage override to work around it.

…H tests. Signed-off-by: Kevin Baichoo <[email protected]>

Signed-off-by: Kevin Baichoo <[email protected]>

mattklein123

I think death tests sometimes do count for coverage, but I don't know when/how that happens. It would be nice to fix this at some point but LGTM for now!

mattklein123 · 2020-09-16T00:08:31Z

Still broken. :(

/wait

KBaichoo · 2020-09-16T00:27:47Z

Ah, 🤦 the comment lines aren't "covered"... so the percentages have dropped slightly.

… line coverage percent. Signed-off-by: Kevin Baichoo <[email protected]>

init implementation of abort_action.

9f43939

Signed-off-by: Kevin Baichoo <[email protected]>

repokitteh-read-only bot added the api label Aug 27, 2020

Spelling, minor fixes.

0154034

Signed-off-by: Kevin Baichoo <[email protected]>

mattklein123 assigned antoniovicente Aug 28, 2020

KBaichoo added 3 commits August 28, 2020 21:04

minor test fixes.

e2dcb42

Signed-off-by: Kevin Baichoo <[email protected]>

Merge remote-tracking branch 'upstream/master' into backup-wd-abort

6955b10

Signed-off-by: Kevin Baichoo <[email protected]>

merge conflict fix.

33a4f5e

Signed-off-by: Kevin Baichoo <[email protected]>

antoniovicente reviewed Aug 29, 2020

View reviewed changes

minor fixes

71f099f

Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo commented Aug 31, 2020

View reviewed changes

antoniovicente previously approved these changes Sep 3, 2020

View reviewed changes

htuch assigned ggreenway Sep 8, 2020

ggreenway requested changes Sep 9, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Sep 9, 2020

minor fixes.

847aed0

Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo dismissed antoniovicente’s stale review via 847aed0 September 10, 2020 02:57

KBaichoo requested a review from htuch as a code owner September 10, 2020 02:57

repokitteh-read-only bot removed the waiting label Sep 10, 2020

Merged in master.

0b3c91c

Signed-off-by: Kevin Baichoo <[email protected]>

ggreenway previously approved these changes Sep 10, 2020

View reviewed changes

repokitteh-read-only bot removed the api label Sep 11, 2020

mattklein123 reviewed Sep 12, 2020

View reviewed changes

repokitteh-read-only bot added the waiting:any label Sep 12, 2020

mattklein123 self-assigned this Sep 12, 2020

Added abort_action to WINDOWS_SKIP_TARGETS.

371866f

Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo dismissed ggreenway’s stale review via 371866f September 14, 2020 14:42

repokitteh-read-only bot added the waiting label Sep 14, 2020

added comment about panic. added skip on windows tags.

80f2dc0

Signed-off-by: Kevin Baichoo <[email protected]>

repokitteh-read-only bot removed the waiting label Sep 15, 2020

mattklein123 added the waiting label Sep 15, 2020

Merge remote-tracking branch 'upstream/master' into backup-wd-abort

11d6066

Signed-off-by: Kevin Baichoo <[email protected]>

repokitteh-read-only bot removed the waiting label Sep 15, 2020

mattklein123 previously approved these changes Sep 15, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Sep 15, 2020

Added LCOV threshold that will get covered by tested that aren't DEAT…

56b3f82

…H tests. Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo dismissed mattklein123’s stale review via 56b3f82 September 15, 2020 21:43

repokitteh-read-only bot removed the waiting label Sep 15, 2020

Added additional death test comments.

5fb84a9

Signed-off-by: Kevin Baichoo <[email protected]>

antoniovicente previously approved these changes Sep 15, 2020

View reviewed changes

mattklein123 previously approved these changes Sep 15, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Sep 16, 2020

Adjusted coverage threshold since comments and empty new lines affect…

947bd06

… line coverage percent. Signed-off-by: Kevin Baichoo <[email protected]>

KBaichoo dismissed stale reviews from mattklein123 and antoniovicente via 947bd06 September 16, 2020 00:38

repokitteh-read-only bot removed the waiting label Sep 16, 2020

mattklein123 approved these changes Sep 16, 2020

View reviewed changes

mattklein123 merged commit cd5bdc4 into envoyproxy:master Sep 16, 2020

KBaichoo mentioned this pull request Sep 21, 2020

Watchdog: use abort action as a default #13208

Closed

Conversation

KBaichoo commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Aug 27, 2020

Uh oh!

antoniovicente left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Sep 1, 2020

Uh oh!

repokitteh-read-only bot commented Sep 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Sep 3, 2020

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KBaichoo commented Aug 27, 2020 •

edited

Loading

KBaichoo Sep 10, 2020 •

edited

Loading