Skip to content

Conversation

@jschaul
Copy link
Member

@jschaul jschaul commented Aug 25, 2025

We currently get a few 500s sometime when gundeck restarts, where some current requests seem to get aborted mid-request. This is possibly due to terminating pods still getting some traffic.

https://wearezeta.atlassian.net/browse/WPB-19694

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

We currently get a few 500s sometime when gundeck restarts, where some
current requests seem to get aborted mid-request. This is possibly due
to terminating pods still getting some traffic.

https://wearezeta.atlassian.net/browse/WPB-19694
@jschaul jschaul requested review from a team as code owners August 25, 2025 16:15
@zebot zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Aug 25, 2025
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I'm not an K8s expert: Could you please explain why we need the sleep command? Why isn't it good enough to only increase terminationGracePeriodSeconds? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default terminationGracePeriodSeconds is 30 seconds, which is plenty.

What this sleep would do is delaying the time between the pod entering the Terminating state, and thus being removed from any Service that sends traffic there.
However, gundeck will still accept any in-flight requests and process them, since it doesn't know about being terminated.

Shamelessly adapted from SO to visualize and explain the process:

The full sequence is :

  1. pod deletion is requested (state: Terminating)
  2. preStop hook kicks in and terminationGracePeriodSeconds countdown starts :
    • when preStop hook completes, kubelet sends a SIGTERM to the container
    • if preStop hook isn't finished within terminationGracePeriodSeconds countdown, kubelet sends SIGKILL to the container
image

→ this means:

  • we delay the shutdown (SIGTERM) to gundeck to allow for the k8s API to remove it from its Service and process
  • once the 10s counter is finished, the regular SIGTERM signal is sent, causing gundeck to gracefully shutdown
  • if the whole thing takes longer than terminationGracePeriodSeconds, it's simply killed.

@jschaul jschaul merged commit 387898a into develop Sep 11, 2025
8 checks passed
@jschaul jschaul deleted the gundeck-deployments-500s branch September 11, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants