Skip to content

fix(http): retry transient HTTP failures with backoff and warn on rescue#9414

Merged
jdx merged 10 commits intomainfrom
claude/agitated-bouman-7eb98c
Apr 26, 2026
Merged

fix(http): retry transient HTTP failures with backoff and warn on rescue#9414
jdx merged 10 commits intomainfrom
claude/agitated-bouman-7eb98c

Conversation

@jdx
Copy link
Copy Markdown
Owner

@jdx jdx commented Apr 26, 2026

Summary

  • Turn on transient-failure retries by default (3 attempts) for all HTTP work in mise. Previously http_retries defaulted to 0, so a one-off 502 from GitHub releases (like the one in run #24963676401) failed installs immediately even though the retry plumbing was already there.
  • Classify errors: retry 5xx, 408, 429, network/connect/timeout, mid-stream body drops; do not retry other 4xx (avoids wasting time on legitimate 404s when probing for tarball variants).
  • Cover the chunk-streaming path in download_file_with_headers too — the original retry only wrapped the initial GET, so a connection drop mid-tarball still failed. Mirrored the same coverage in the vfox crate's download() and lua download_file.
  • Emit a warn! whenever a retry rescues a request, with the original transient error included, so flaky infrastructure doesn't silently mask itself.
  • Tighten the http→https fallback to fire only on connection-level errors, not on HTTP status errors — falling back to https after the server already returned a 4xx made no sense and was silently double-querying every non-2xx response.
  • Hand-rolled backoff schedule (jittered: ~200ms / ~1s / ~4s / ~15s) instead of the previous ExponentialBackoff::from_millis(10) (10/100/1000ms — too short for real server outages). Worst-case added latency at default settings stays around ~20s.

MISE_HTTP_RETRIES=0 still works as an opt-out.

Test plan

  • New unit tests in src/http.rs using a tiny in-process TCP server cover: retry-rescues-after-502s, no-retry-on-404, retry-exhaustion-on-persistent-500, and MISE_HTTP_RETRIES=0 regression check.
  • All 782 existing unit tests still pass (mise run test:unit).
  • vfox crate tests still pass (73 passed).
  • mise run lint clean.
  • Manual smoke test: MISE_DEBUG=1 mise install aqua:mvdan/sh@latest (the originally failing tool).

🤖 Generated with Claude Code


Note

Medium Risk
Changes core HTTP request/download behavior by enabling retries by default and altering retry/backoff and http→https fallback logic, which can affect install/update flows and error surfaces under flaky networks.

Overview
Enables transient HTTP failure retries by default (http_retries now defaults to 3) and documents the new behavior in settings.toml and schema/mise.json.

Replaces tokio-retry with a custom retry_async/backoff implementation in src/http.rs, including transient error classification (5xx/408/429 + network/connect/timeout/body drops), warn logging on each rescued attempt, and a clearer jittered schedule (~200ms/~1s/~4s/~15s). It also restructures downloads to retry the entire streamed body and tightens the http→https fallback to only trigger on connection-level failures.

Mirrors the same retry semantics in the vfox crate (including Lua HTTP download paths) and adds focused unit tests validating retry-on-5xx, no-retry-on-404, exhaustion behavior, and retries-disabled handling.

Reviewed by Cursor Bugbot for commit ccb2d06. Bugbot is set up for automated code reviews on this repo. Configure here.

5xx responses (e.g. GitHub's occasional 502 on release downloads) and
network-layer drops were failing installs immediately because http_retries
defaulted to 0 and the wrapped retry layer never fired. Turn retries on by
default and add a transient-vs-permanent classifier so 4xx still fails fast.

Cover the chunk-streaming path in download_file_with_headers too — the
original retry only wrapped the initial GET, so a connection drop mid-tarball
would still fail. Mirror the same behavior in the vfox crate's downloads.

When a retry rescues a request, log a warn! with the original error so
flaky infrastructure doesn't silently mask itself.

Tighten the http→https fallback to fire only on connection-level errors,
not on HTTP status errors — falling back to https after the server already
returned a 4xx makes no sense and was silently double-querying on every
non-2xx response.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR enables HTTP retries by default (http_retries: 0 → 3), replaces tokio-retry with a hand-rolled retry_async loop using a jittered schedule (~200ms/1s/4s/15s), classifies transient errors (5xx, 408, 429, connect/timeout/body), tightens the http→https fallback to connection-level errors only, and extends retry coverage to the full download stream in download_file_with_headers, vfox, and Lua downloads. Unit tests using an in-process TCP server verify the retry/no-retry/exhaustion paths.

Three P2 gaps noted:

  • get_bytes, get_text, and json_* still read the response body outside retry_async, so mid-stream body drops on those paths aren't retried.
  • 429 retries use the fixed backoff schedule rather than the Retry-After header, potentially burning all retries in ~5 seconds when a server asks for a 60-second wait.
  • The vfox retry_delay jitter uses SystemTime::subsec_nanos, which can be highly correlated across retries in the same ~200ms window.

Confidence Score: 5/5

Safe to merge; all findings are P2 improvement suggestions with no present defects in the changed paths.

All three findings are P2 (non-blocking): partial retry coverage for non-download body reads, missing Retry-After honoring, and correlated jitter in vfox. None cause incorrect behavior for the primary use case (file downloads), and the new unit tests directly validate the retry/exhaustion/no-retry contracts.

src/http.rs (body-read retry gap and 429 Retry-After), crates/vfox/src/http.rs (jitter quality)

Important Files Changed

Filename Overview
src/http.rs Core retry overhaul: replaces tokio-retry with hand-rolled retry_async loop, adds is_transient/is_connection_error classifiers, wraps download_file_with_headers chunk loop inside the retry closure. Two P2 gaps remain: body reads in get_bytes/get_text/json are outside the retry closure, and 429 retries ignore Retry-After.
crates/vfox/src/http.rs New shared retry utilities for the vfox crate. The jitter implementation uses SystemTime subsec_nanos, which can be highly correlated across retries in the same session (P2).
crates/vfox/src/lua_mod/http.rs Refactored to use shared retry utilities. download_file and try_download_file now use retry_async wrapping request+body; get/head paths still use the local send_with_retry loop.
crates/vfox/src/vfox.rs Vfox download() now uses retry_async wrapping the full request+bytes flow. Change is minimal and correct.
settings.toml http_retries default changed from 0 to 3; documentation updated to describe transient-error classification and backoff schedule.
schema/mise.json Schema default for http_retries updated from 0 to 3, matching the settings.toml change.
Cargo.toml tokio-retry dependency removed; no longer needed since retry logic is hand-rolled. Clean removal.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HTTP request entry point] --> B{download?}
    B -- yes --> C[download_file_with_headers]
    B -- no --> D[get_async_with_headers / head_async_with_headers]

    C --> E[retry_async loop max http_retries attempts]
    E --> F[send_once_with_https_fallback]
    F --> G[send_once]
    G --> H{status error?}
    H -- 5xx/408/429 --> I{is_transient?}
    H -- 4xx other --> J[return Err immediately]
    I -- yes + retries left --> K[jittered backoff sleep ~200ms/1s/4s/15s]
    K --> E
    I -- no / retries exhausted --> J
    H -- ok --> L[stream chunks to tempfile]
    L --> M{chunk error?}
    M -- is_body error --> I
    M -- ok --> N[persist to path]

    D --> O[send_with_https_fallback]
    O --> E2[retry_async loop]
    E2 --> F2[send_once_with_https_fallback]
    F2 --> G2[send_once error_for_status_ref]
    G2 --> I2{is_transient?}
    I2 -- yes + retries left --> K2[jittered backoff sleep]
    K2 --> E2
    I2 -- no / exhausted --> J2[return Err]
    G2 -- ok --> P[return Response]
    P --> Q[resp.bytes / .text / .json NOT retried]
Loading

Fix All in Claude Code

Reviews (7): Last reviewed commit: "chore(deps): regenerate Cargo.lock after..." | Re-trigger Greptile

Comment thread src/http.rs
Comment thread crates/vfox/src/http.rs Outdated
Comment thread src/http.rs
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 26, 2026

Hyperfine Performance

mise x -- echo

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.4.22 x -- echo 24.7 ± 0.9 23.5 39.9 1.00
mise x -- echo 25.5 ± 1.2 24.1 35.6 1.03 ± 0.06

mise env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.4.22 env 23.5 ± 0.7 22.6 29.2 1.00
mise env 24.1 ± 0.9 23.3 36.5 1.03 ± 0.05

mise hook-env

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.4.22 hook-env 25.2 ± 0.8 23.6 28.7 1.00
mise hook-env 26.5 ± 1.1 24.7 44.9 1.05 ± 0.06

mise ls

Command Mean [ms] Min [ms] Max [ms] Relative
mise-2026.4.22 ls 25.6 ± 0.5 24.2 29.4 1.00
mise ls 27.2 ± 0.4 26.1 28.9 1.06 ± 0.03

xtasks/test/perf

Command mise-2026.4.22 mise Variance
install (cached) 174ms 169ms +2%
ls (cached) 88ms 86ms +2%
bin-paths (cached) 87ms 88ms -1%
task-ls (cached) 809ms 798ms +1%

- vfox crate now reads MISE_HTTP_RETRIES at runtime instead of using a
  hardcoded retry count, so the documented opt-out works for vfox-driven
  downloads too.
- Backoff schedule in vfox now matches the main crate (~200ms / ~1s / ~4s
  / ~15s) rather than a linear 200/400/600ms.
- Replace `while let Some(_) = iter.next()` with `for _ in iter` to satisfy
  clippy::while_let_on_iterator (only enforced in CI, not local lint).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jdx
Copy link
Copy Markdown
Owner Author

jdx commented Apr 26, 2026

Thanks for the review. Addressed both issues:

Issue 2 (vfox ignores http_retries) — fixed in 043fc68. vfox now reads MISE_HTTP_RETRIES directly (since it can't depend on mise's Settings layer) and the backoff schedule matches the main crate's ~200ms / ~1s / ~4s / ~15s.

Issue 1 (5xx retry dead for non-download paths) — I think this is a misread. send_once calls resp.error_for_status_ref()? at src/http.rs:435, so a 502 surfaces as Err(reqwest::Error) inside the closure, not as Ok(Response_5xx). The retry loop then sees Err, classifies it as transient via status() == Some(502), and retries.

The two unit tests (test_retry_succeeds_after_two_502s, test_retry_exhausted_on_persistent_500) call client.get_async() — exactly the non-download path the analysis claims is broken — and both pass on macOS CI (unit job). They assert count == 3 and count == 2 respectively, which would fail if no retry was happening.

That said, the redundant error_for_status_ref()? calls in get_async_with_headers and head_async_with_headers are a smell — they're no-ops because send_once already did the check. I'll leave the cleanup for a separate PR to keep this one focused.

This comment was generated by an AI coding assistant.

jdx and others added 2 commits April 26, 2026 17:07
The fixed 4-element schedule with `.take(retries)` capped any
`MISE_HTTP_RETRIES > 4` at 4 retries — a regression vs. the previous
unbounded `ExponentialBackoff::from_millis(10)`. Chain the schedule with
a repeat of the longest delay (15s) so any retry count is honored.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Previously a transient failure was silent until the retry sequence either
succeeded (warn on rescue) or exhausted (raw error propagated). With the
new ~15s tail of the backoff schedule, the user would sit through long
delays with no feedback about what was going wrong.

Now every transient failure emits a warn! the moment it happens, with the
attempt number and next-retry delay. Successful rescues still warn
("succeeded on attempt N"), and exhausted retries also warn before
propagating the final error so the attempt count is visible.

Mirror the same behavior in the vfox crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread crates/vfox/src/lua_mod/http.rs Outdated
jdx and others added 2 commits April 26, 2026 17:21
The retry-status arm in vfox's send_with_retry only fires while attempts
remain (`attempt + 1 < attempts`), so the final attempt's 5xx falls
through to the success arm. With prior transient failures, that arm was
warning "succeeded on attempt N" while returning a 5xx response —
exactly the silent-mask the warnings were added to prevent.

Distinguish real success from "ran out of retries with a bad status" by
checking `resp.status().is_success()` and logging "failed after N
attempts: HTTP <status>" otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Per-attempt warnings already announce each transient failure as it
happens. The post-loop "succeeded on attempt N" / "failed after N
attempts" warnings just duplicated information the user already saw or
will see via the propagated response/error.

Keep the per-attempt warns (they're the live signal), drop the rest.
Same simplification across all three retry paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread src/http.rs
The retry refactor swapped get_async_with_headers for the lower-level
send_once_with_https_fallback to avoid retry-on-retry, which dropped the
offline-mode guard that get_async_with_headers does. Add it back at the
top of download_file_with_headers so downloads honor MISE_OFFLINE again.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2cd5b8f. Configure here.

Comment thread src/http.rs
jdx and others added 3 commits April 26, 2026 17:48
tokio_retry's `jitter` returns a value in [0, d), so a 200ms base could
become 2ms — effectively no backoff. Replace with an "equal jitter"
helper (random in [d/2, d)) matching what the vfox crate already does.
This keeps the documented schedule (~200ms / ~1s / ~4s / ~15s)
honest at the lower bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Replaced its `jitter` helper with a local equal-jitter implementation in
the previous commit; the crate's `Retry` driver was already removed when
the manual loop landed. cargo machete flagged it.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jdx jdx enabled auto-merge (squash) April 26, 2026 23:36
@jdx jdx merged commit 5b9e303 into main Apr 26, 2026
50 of 53 checks passed
@jdx jdx deleted the claude/agitated-bouman-7eb98c branch April 26, 2026 23:40
jdx added a commit that referenced this pull request Apr 26, 2026
## Summary

Add `mise.en.dev` to the list of Cloudflare zones purged at the end of
`scripts/publish-s3.sh`. Previously only `jdx.dev` and `mise.run` were
being purged.

## Why

`install.sh` and `install.sh.minisig` are uploaded to S3 with
`cache-control: max-age=86400,s-maxage=86400,public,immutable`. Without
an explicit purge per CDN zone, each zone keeps serving the previous
release's bytes for up to 24 hours — even after S3 has the new bytes.

Since #9411 made `mise.en.dev` the canonical bootstrap host (used by
`mise generate tool-stub --bootstrap` and `mise generate bootstrap`),
this manifested as: `mise.en.dev/install.sh` serving the v(N-1) script
next to a v(N) `install.sh.minisig`, causing minisign verification to
fail. Caught today as recurring CI failures on
[#9414](#9414) (e2e-0 / e2e-1).

The other half — that `scripts/update-redirect.sh` was deleted in #9411
— turned out not to be related; that script only updated a
`mise-latest-*` redirect rule, not the install.sh path. The real issue
is just the missing purge.

## Test plan

- [x] Bash syntax check (`bash -n scripts/publish-s3.sh`)
- [x] Verified the en.dev zone ID `531d003297f1f4ae2415b41f7f5da8fa`
matches the value previously used in the now-deleted
`scripts/update-redirect.sh` (commit `68075d866`)
- [ ] On the next release, confirm in the workflow logs that all three
purges run, and that `curl https://mise.en.dev/install.sh` returns the
new version's content within seconds of the deploy completing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: only adjusts post-publish CDN cache purging logic to include
an additional zone and reduce duplication; no changes to artifact
generation or upload behavior.
> 
> **Overview**
> After publishing release artifacts to S3, `scripts/publish-s3.sh` now
purges Cloudflare cache for **all** relevant CDN zones via a looped
`ZONES` list, adding the missing `en.dev`/`mise.en.dev` zone.
> 
> This replaces the two hardcoded purge calls with a single per-zone
purge step to prevent mixed-version `install.sh`/signature artifacts
being served from different zones under `immutable` caching.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e083358. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
mise-en-dev added a commit that referenced this pull request Apr 27, 2026
### 🚀 Features

- **(ls-remote)** add `prereleases` setting and `--prerelease` flag by
@jdx in [#9415](#9415)

### 🐛 Bug Fixes

- **(http)** retry transient HTTP failures with backoff and warn on
rescue by @jdx in [#9414](#9414)
- **(release)** purge mise.en.dev CDN zone after each S3 publish by @jdx
in [#9416](#9416)

### 📚 Documentation

- prefix GitHub star count with ★ glyph by @jdx in
[#9417](#9417)
- update intro messaging by @jdx in
[#9418](#9418)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant