Skip to content

Fix docker compose up --wait failing when Trillian server isn't healthy#2473

Merged
Hayden-IO merged 1 commit intosigstore:mainfrom
Hayden-IO:fix-healthchecks
May 14, 2025
Merged

Fix docker compose up --wait failing when Trillian server isn't healthy#2473
Hayden-IO merged 1 commit intosigstore:mainfrom
Hayden-IO:fix-healthchecks

Conversation

@Hayden-IO
Copy link
Copy Markdown
Contributor

@Hayden-IO Hayden-IO commented May 6, 2025

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor:

  • MySQL was not yet healthy because the healthcheck wasn't working as expected. Correct health check for MySQL 5.7 to prevent connecting to temporary server docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost
  • trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget.
  • rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Summary

Release Note

Documentation

@Hayden-IO Hayden-IO requested a review from a team as a code owner May 6, 2025 02:32
Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025
Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it.

Signed-off-by: Hayden B <[email protected]>
@codecov
Copy link
Copy Markdown

codecov bot commented May 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 25.17%. Comparing base (488eb97) to head (ed660e5).
Report is 403 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2473       +/-   ##
===========================================
- Coverage   66.46%   25.17%   -41.29%     
===========================================
  Files          92      191       +99     
  Lines        9258    24790    +15532     
===========================================
+ Hits         6153     6240       +87     
- Misses       2359    17784    +15425     
- Partials      746      766       +20     
Flag Coverage Δ
e2etests 47.03% <ø> (-0.53%) ⬇️
unittests 16.37% <ø> (-31.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Hayden-IO Hayden-IO marked this pull request as draft May 6, 2025 02:48
@Hayden-IO Hayden-IO force-pushed the fix-healthchecks branch 5 times, most recently from 28b880a to 54495cb Compare May 6, 2025 04:47
@Hayden-IO Hayden-IO marked this pull request as ready for review May 6, 2025 05:27
cpanato
cpanato previously approved these changes May 6, 2025
Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025
Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it.

Signed-off-by: Hayden B <[email protected]>
@Hayden-IO Hayden-IO force-pushed the fix-healthchecks branch from 54495cb to 7750b7a Compare May 6, 2025 16:14
Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025
As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password.

Signed-off-by: Hayden B <[email protected]>
Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025
As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password.

Signed-off-by: Hayden B <[email protected]>
@Hayden-IO Hayden-IO requested review from bobcallaway and cpanato May 6, 2025 16:37
cpanato
cpanato previously approved these changes May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <[email protected]>
@Hayden-IO Hayden-IO requested a review from cpanato May 13, 2025 19:10
@Hayden-IO Hayden-IO enabled auto-merge (squash) May 13, 2025 19:11
@Hayden-IO Hayden-IO merged commit 62a6617 into sigstore:main May 14, 2025
16 checks passed
@Hayden-IO Hayden-IO deleted the fix-healthchecks branch May 14, 2025 06:36
Hayden-IO pushed a commit to sigstore/rekor-monitor that referenced this pull request May 20, 2025
With sigstore/rekor#2473, healthchecks have been added for Trillian's log signer and log server, so there are now 5 services rather than 3 to wait for. E2E tests failed because there will never be just 3 healthy services, there will be 5.

This e2e script is a little brittle and can be cleaned up and simplified to use `compose up --wait`, will do in a later PR.

Signed-off-by: Hayden B <[email protected]>
Hayden-IO pushed a commit to sigstore/rekor-monitor that referenced this pull request May 20, 2025
With sigstore/rekor#2473, healthchecks have been added for Trillian's log signer and log server, so there are now 5 services rather than 3 to wait for. E2E tests failed because there will never be just 3 healthy services, there will be 5.

This e2e script is a little brittle and can be cleaned up and simplified to use `compose up --wait`, will do in a later PR.

Signed-off-by: Hayden B <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants