Fix docker compose up --wait failing when Trillian server isn't healthy by Hayden-IO · Pull Request #2473 · sigstore/rekor

Hayden-IO · 2025-05-06T02:32:50Z

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor:

MySQL was not yet healthy because the healthcheck wasn't working as expected. Correct health check for MySQL 5.7 to prevent connecting to temporary server docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost
trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget.
rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Summary

Release Note

Documentation

Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it. Signed-off-by: Hayden B <[email protected]>

codecov · 2025-05-06T02:39:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 25.17%. Comparing base (488eb97) to head (ed660e5).
Report is 403 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2473       +/-   ##
===========================================
- Coverage   66.46%   25.17%   -41.29%     
===========================================
  Files          92      191       +99     
  Lines        9258    24790    +15532     
===========================================
+ Hits         6153     6240       +87     
- Misses       2359    17784    +15425     
- Partials      746      766       +20

Flag	Coverage Δ
e2etests	`47.03% <ø> (-0.53%)`	⬇️
unittests	`16.37% <ø> (-31.31%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Same as documented in sigstore/rekor#2473, the MySQL healthcheck was inaccurately reporting healthy when localhost was used. Switching to an address seems to fix it. Signed-off-by: Hayden B <[email protected]>

docker-compose.yml

As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password. Signed-off-by: Hayden B <[email protected]>

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <[email protected]>

With sigstore/rekor#2473, healthchecks have been added for Trillian's log signer and log server, so there are now 5 services rather than 3 to wait for. E2E tests failed because there will never be just 3 healthy services, there will be 5. This e2e script is a little brittle and can be cleaned up and simplified to use `compose up --wait`, will do in a later PR. Signed-off-by: Hayden B <[email protected]>

Hayden-IO requested a review from a team as a code owner May 6, 2025 02:32

Hayden-IO mentioned this pull request May 6, 2025

Update compose healthcheck for MySQL sigstore/fulcio#2038

Merged

Hayden-IO marked this pull request as draft May 6, 2025 02:48

Hayden-IO force-pushed the fix-healthchecks branch 5 times, most recently from 28b880a to 54495cb Compare May 6, 2025 04:47

Hayden-IO marked this pull request as ready for review May 6, 2025 05:27

cpanato previously approved these changes May 6, 2025

View reviewed changes

bobcallaway reviewed May 6, 2025

View reviewed changes

docker-compose.yml Outdated Show resolved Hide resolved

Hayden-IO dismissed cpanato’s stale review via 7750b7a May 6, 2025 16:14

Hayden-IO force-pushed the fix-healthchecks branch from 54495cb to 7750b7a Compare May 6, 2025 16:14

Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025

Update healthcheck test for MySQL

3fbe6c8

As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password. Signed-off-by: Hayden B <[email protected]>

Hayden-IO mentioned this pull request May 6, 2025

Update healthcheck test for MySQL sigstore/fulcio#2039

Merged

Hayden-IO pushed a commit to sigstore/fulcio that referenced this pull request May 6, 2025

Update healthcheck test for MySQL (#2039)

60ee515

As noted in sigstore/rekor#2473 (comment), this did not properly populate username and password. Signed-off-by: Hayden B <[email protected]>

Hayden-IO mentioned this pull request May 6, 2025

improvements to setup-sigstore-env Action sigstore/scaffolding#1555

Closed

8 tasks

Hayden-IO requested review from bobcallaway and cpanato May 6, 2025 16:37

cpanato previously approved these changes May 6, 2025

View reviewed changes

Hayden-IO dismissed cpanato’s stale review via 496e046 May 13, 2025 18:43

Hayden-IO force-pushed the fix-healthchecks branch from 7750b7a to 496e046 Compare May 13, 2025 18:43

Hayden-IO force-pushed the fix-healthchecks branch from 496e046 to ed660e5 Compare May 13, 2025 18:44

Hayden-IO requested a review from cpanato May 13, 2025 19:10

Hayden-IO enabled auto-merge (squash) May 13, 2025 19:11

cpanato approved these changes May 14, 2025

View reviewed changes

Hayden-IO merged commit 62a6617 into sigstore:main May 14, 2025
16 checks passed

Hayden-IO deleted the fix-healthchecks branch May 14, 2025 06:36

Hayden-IO mentioned this pull request May 20, 2025

Fix e2e test waiting until services are healthy sigstore/rekor-monitor#670

Merged

Hayden-IO mentioned this pull request May 28, 2025

More linting coverage sigstore/cosign#4213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix docker compose up --wait failing when Trillian server isn't healthy#2473

Fix docker compose up --wait failing when Trillian server isn't healthy#2473
Hayden-IO merged 1 commit intosigstore:mainfrom
Hayden-IO:fix-healthchecks

Hayden-IO commented May 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Hayden-IO commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Release Note

Documentation

Uh oh!

codecov bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hayden-IO commented May 6, 2025 •

edited

Loading

codecov bot commented May 6, 2025 •

edited

Loading