fix(http): retry transient HTTP failures with backoff and warn on rescue by jdx · Pull Request #9414 · jdx/mise

jdx · 2026-04-26T21:03:49Z

Summary

Turn on transient-failure retries by default (3 attempts) for all HTTP work in mise. Previously http_retries defaulted to 0, so a one-off 502 from GitHub releases (like the one in run #24963676401) failed installs immediately even though the retry plumbing was already there.
Classify errors: retry 5xx, 408, 429, network/connect/timeout, mid-stream body drops; do not retry other 4xx (avoids wasting time on legitimate 404s when probing for tarball variants).
Cover the chunk-streaming path in download_file_with_headers too — the original retry only wrapped the initial GET, so a connection drop mid-tarball still failed. Mirrored the same coverage in the vfox crate's download() and lua download_file.
Emit a warn! whenever a retry rescues a request, with the original transient error included, so flaky infrastructure doesn't silently mask itself.
Tighten the http→https fallback to fire only on connection-level errors, not on HTTP status errors — falling back to https after the server already returned a 4xx made no sense and was silently double-querying every non-2xx response.
Hand-rolled backoff schedule (jittered: ~200ms / ~1s / ~4s / ~15s) instead of the previous ExponentialBackoff::from_millis(10) (10/100/1000ms — too short for real server outages). Worst-case added latency at default settings stays around ~20s.

MISE_HTTP_RETRIES=0 still works as an opt-out.

Test plan

New unit tests in src/http.rs using a tiny in-process TCP server cover: retry-rescues-after-502s, no-retry-on-404, retry-exhaustion-on-persistent-500, and MISE_HTTP_RETRIES=0 regression check.
All 782 existing unit tests still pass (mise run test:unit).
vfox crate tests still pass (73 passed).
mise run lint clean.
Manual smoke test: MISE_DEBUG=1 mise install aqua:mvdan/sh@latest (the originally failing tool).

🤖 Generated with Claude Code

Note

Medium Risk
Changes core HTTP request/download behavior by enabling retries by default and altering retry/backoff and http→https fallback logic, which can affect install/update flows and error surfaces under flaky networks.

Overview
Enables transient HTTP failure retries by default (http_retries now defaults to 3) and documents the new behavior in settings.toml and schema/mise.json.

Replaces tokio-retry with a custom retry_async/backoff implementation in src/http.rs, including transient error classification (5xx/408/429 + network/connect/timeout/body drops), warn logging on each rescued attempt, and a clearer jittered schedule (~200ms/~1s/~4s/~15s). It also restructures downloads to retry the entire streamed body and tightens the http→https fallback to only trigger on connection-level failures.

Mirrors the same retry semantics in the vfox crate (including Lua HTTP download paths) and adds focused unit tests validating retry-on-5xx, no-retry-on-404, exhaustion behavior, and retries-disabled handling.

^{Reviewed by Cursor Bugbot for commit ccb2d06. Bugbot is set up for automated code reviews on this repo. Configure here.}

5xx responses (e.g. GitHub's occasional 502 on release downloads) and network-layer drops were failing installs immediately because http_retries defaulted to 0 and the wrapped retry layer never fired. Turn retries on by default and add a transient-vs-permanent classifier so 4xx still fails fast. Cover the chunk-streaming path in download_file_with_headers too — the original retry only wrapped the initial GET, so a connection drop mid-tarball would still fail. Mirror the same behavior in the vfox crate's downloads. When a retry rescues a request, log a warn! with the original error so flaky infrastructure doesn't silently mask itself. Tighten the http→https fallback to fire only on connection-level errors, not on HTTP status errors — falling back to https after the server already returned a 4xx makes no sense and was silently double-querying on every non-2xx response. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

gemini-code-assist · 2026-04-26T21:03:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

greptile-apps · 2026-04-26T21:08:54Z

Greptile Summary

This PR enables HTTP retries by default (http_retries: 0 → 3), replaces tokio-retry with a hand-rolled retry_async loop using a jittered schedule (~200ms/1s/4s/15s), classifies transient errors (5xx, 408, 429, connect/timeout/body), tightens the http→https fallback to connection-level errors only, and extends retry coverage to the full download stream in download_file_with_headers, vfox, and Lua downloads. Unit tests using an in-process TCP server verify the retry/no-retry/exhaustion paths.

Three P2 gaps noted:

get_bytes, get_text, and json_* still read the response body outside retry_async, so mid-stream body drops on those paths aren't retried.
429 retries use the fixed backoff schedule rather than the Retry-After header, potentially burning all retries in ~5 seconds when a server asks for a 60-second wait.
The vfox retry_delay jitter uses SystemTime::subsec_nanos, which can be highly correlated across retries in the same ~200ms window.

Confidence Score: 5/5

Safe to merge; all findings are P2 improvement suggestions with no present defects in the changed paths.

All three findings are P2 (non-blocking): partial retry coverage for non-download body reads, missing Retry-After honoring, and correlated jitter in vfox. None cause incorrect behavior for the primary use case (file downloads), and the new unit tests directly validate the retry/exhaustion/no-retry contracts.

src/http.rs (body-read retry gap and 429 Retry-After), crates/vfox/src/http.rs (jitter quality)

Important Files Changed

Filename	Overview
src/http.rs	Core retry overhaul: replaces tokio-retry with hand-rolled retry_async loop, adds is_transient/is_connection_error classifiers, wraps download_file_with_headers chunk loop inside the retry closure. Two P2 gaps remain: body reads in get_bytes/get_text/json are outside the retry closure, and 429 retries ignore Retry-After.
crates/vfox/src/http.rs	New shared retry utilities for the vfox crate. The jitter implementation uses SystemTime subsec_nanos, which can be highly correlated across retries in the same session (P2).
crates/vfox/src/lua_mod/http.rs	Refactored to use shared retry utilities. download_file and try_download_file now use retry_async wrapping request+body; get/head paths still use the local send_with_retry loop.
crates/vfox/src/vfox.rs	Vfox download() now uses retry_async wrapping the full request+bytes flow. Change is minimal and correct.
settings.toml	http_retries default changed from 0 to 3; documentation updated to describe transient-error classification and backoff schedule.
schema/mise.json	Schema default for http_retries updated from 0 to 3, matching the settings.toml change.
Cargo.toml	tokio-retry dependency removed; no longer needed since retry logic is hand-rolled. Clean removal.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HTTP request entry point] --> B{download?}
    B -- yes --> C[download_file_with_headers]
    B -- no --> D[get_async_with_headers / head_async_with_headers]

    C --> E[retry_async loop max http_retries attempts]
    E --> F[send_once_with_https_fallback]
    F --> G[send_once]
    G --> H{status error?}
    H -- 5xx/408/429 --> I{is_transient?}
    H -- 4xx other --> J[return Err immediately]
    I -- yes + retries left --> K[jittered backoff sleep ~200ms/1s/4s/15s]
    K --> E
    I -- no / retries exhausted --> J
    H -- ok --> L[stream chunks to tempfile]
    L --> M{chunk error?}
    M -- is_body error --> I
    M -- ok --> N[persist to path]

    D --> O[send_with_https_fallback]
    O --> E2[retry_async loop]
    E2 --> F2[send_once_with_https_fallback]
    F2 --> G2[send_once error_for_status_ref]
    G2 --> I2{is_transient?}
    I2 -- yes + retries left --> K2[jittered backoff sleep]
    K2 --> E2
    I2 -- no / exhausted --> J2[return Err]
    G2 -- ok --> P[return Response]
    P --> Q[resp.bytes / .text / .json NOT retried]

_{Reviews (7): Last reviewed commit: "chore(deps): regenerate Cargo.lock after..." | Re-trigger Greptile}

github-actions · 2026-04-26T21:23:20Z

Hyperfine Performance

`mise x -- echo`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.4.22 x -- echo`	24.7 ± 0.9	23.5	39.9	1.00
`mise x -- echo`	25.5 ± 1.2	24.1	35.6	1.03 ± 0.06

`mise env`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.4.22 env`	23.5 ± 0.7	22.6	29.2	1.00
`mise env`	24.1 ± 0.9	23.3	36.5	1.03 ± 0.05

`mise hook-env`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.4.22 hook-env`	25.2 ± 0.8	23.6	28.7	1.00
`mise hook-env`	26.5 ± 1.1	24.7	44.9	1.05 ± 0.06

`mise ls`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`mise-2026.4.22 ls`	25.6 ± 0.5	24.2	29.4	1.00
`mise ls`	27.2 ± 0.4	26.1	28.9	1.06 ± 0.03

`xtasks/test/perf`

Command	mise-2026.4.22	mise	Variance
install (cached)	174ms	169ms	+2%
ls (cached)	88ms	86ms	+2%
bin-paths (cached)	87ms	88ms	-1%
task-ls (cached)	809ms	798ms	+1%

- vfox crate now reads MISE_HTTP_RETRIES at runtime instead of using a hardcoded retry count, so the documented opt-out works for vfox-driven downloads too. - Backoff schedule in vfox now matches the main crate (~200ms / ~1s / ~4s / ~15s) rather than a linear 200/400/600ms. - Replace `while let Some(_) = iter.next()` with `for _ in iter` to satisfy clippy::while_let_on_iterator (only enforced in CI, not local lint). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jdx · 2026-04-26T22:03:43Z

Thanks for the review. Addressed both issues:

Issue 2 (vfox ignores http_retries) — fixed in 043fc68. vfox now reads MISE_HTTP_RETRIES directly (since it can't depend on mise's Settings layer) and the backoff schedule matches the main crate's ~200ms / ~1s / ~4s / ~15s.

Issue 1 (5xx retry dead for non-download paths) — I think this is a misread. send_once calls resp.error_for_status_ref()? at src/http.rs:435, so a 502 surfaces as Err(reqwest::Error) inside the closure, not as Ok(Response_5xx). The retry loop then sees Err, classifies it as transient via status() == Some(502), and retries.

The two unit tests (test_retry_succeeds_after_two_502s, test_retry_exhausted_on_persistent_500) call client.get_async() — exactly the non-download path the analysis claims is broken — and both pass on macOS CI (unit job). They assert count == 3 and count == 2 respectively, which would fail if no retry was happening.

That said, the redundant error_for_status_ref()? calls in get_async_with_headers and head_async_with_headers are a smell — they're no-ops because send_once already did the check. I'll leave the cleanup for a separate PR to keep this one focused.

This comment was generated by an AI coding assistant.

The fixed 4-element schedule with `.take(retries)` capped any `MISE_HTTP_RETRIES > 4` at 4 retries — a regression vs. the previous unbounded `ExponentialBackoff::from_millis(10)`. Chain the schedule with a repeat of the longest delay (15s) so any retry count is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Previously a transient failure was silent until the retry sequence either succeeded (warn on rescue) or exhausted (raw error propagated). With the new ~15s tail of the backoff schedule, the user would sit through long delays with no feedback about what was going wrong. Now every transient failure emits a warn! the moment it happens, with the attempt number and next-retry delay. Successful rescues still warn ("succeeded on attempt N"), and exhausted retries also warn before propagating the final error so the attempt count is visible. Mirror the same behavior in the vfox crate. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The retry-status arm in vfox's send_with_retry only fires while attempts remain (`attempt + 1 < attempts`), so the final attempt's 5xx falls through to the success arm. With prior transient failures, that arm was warning "succeeded on attempt N" while returning a 5xx response — exactly the silent-mask the warnings were added to prevent. Distinguish real success from "ran out of retries with a bad status" by checking `resp.status().is_success()` and logging "failed after N attempts: HTTP <status>" otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Per-attempt warnings already announce each transient failure as it happens. The post-loop "succeeded on attempt N" / "failed after N attempts" warnings just duplicated information the user already saw or will see via the propagated response/error. Keep the per-attempt warns (they're the live signal), drop the rest. Same simplification across all three retry paths. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The retry refactor swapped get_async_with_headers for the lower-level send_once_with_https_fallback to avoid retry-on-retry, which dropped the offline-mode guard that get_async_with_headers does. Add it back at the top of download_file_with_headers so downloads honor MISE_OFFLINE again. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2cd5b8f. Configure here.}

tokio_retry's `jitter` returns a value in [0, d), so a 200ms base could become 2ms — effectively no backoff. Replace with an "equal jitter" helper (random in [d/2, d)) matching what the vfox crate already does. This keeps the documented schedule (~200ms / ~1s / ~4s / ~15s) honest at the lower bound. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Replaced its `jitter` helper with a local equal-jitter implementation in the previous commit; the crate's `Retry` driver was already removed when the manual loop landed. cargo machete flagged it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

## Summary Add `mise.en.dev` to the list of Cloudflare zones purged at the end of `scripts/publish-s3.sh`. Previously only `jdx.dev` and `mise.run` were being purged. ## Why `install.sh` and `install.sh.minisig` are uploaded to S3 with `cache-control: max-age=86400,s-maxage=86400,public,immutable`. Without an explicit purge per CDN zone, each zone keeps serving the previous release's bytes for up to 24 hours — even after S3 has the new bytes. Since #9411 made `mise.en.dev` the canonical bootstrap host (used by `mise generate tool-stub --bootstrap` and `mise generate bootstrap`), this manifested as: `mise.en.dev/install.sh` serving the v(N-1) script next to a v(N) `install.sh.minisig`, causing minisign verification to fail. Caught today as recurring CI failures on [#9414](#9414) (e2e-0 / e2e-1). The other half — that `scripts/update-redirect.sh` was deleted in #9411 — turned out not to be related; that script only updated a `mise-latest-*` redirect rule, not the install.sh path. The real issue is just the missing purge. ## Test plan - [x] Bash syntax check (`bash -n scripts/publish-s3.sh`) - [x] Verified the en.dev zone ID `531d003297f1f4ae2415b41f7f5da8fa` matches the value previously used in the now-deleted `scripts/update-redirect.sh` (commit `68075d866`) - [ ] On the next release, confirm in the workflow logs that all three purges run, and that `curl https://mise.en.dev/install.sh` returns the new version's content within seconds of the deploy completing 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **Low Risk** > Low risk: only adjusts post-publish CDN cache purging logic to include an additional zone and reduce duplication; no changes to artifact generation or upload behavior. > > **Overview** > After publishing release artifacts to S3, `scripts/publish-s3.sh` now purges Cloudflare cache for **all** relevant CDN zones via a looped `ZONES` list, adding the missing `en.dev`/`mise.en.dev` zone. > > This replaces the two hardcoded purge calls with a single per-zone purge step to prevent mixed-version `install.sh`/signature artifacts being served from different zones under `immutable` caching. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit e083358. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

@jdx

### 🚀 Features - **(ls-remote)** add `prereleases` setting and `--prerelease` flag by @jdx in [#9415](#9415) ### 🐛 Bug Fixes - **(http)** retry transient HTTP failures with backoff and warn on rescue by @jdx in [#9414](#9414) - **(release)** purge mise.en.dev CDN zone after each S3 publish by @jdx in [#9416](#9416) ### 📚 Documentation - prefix GitHub star count with ★ glyph by @jdx in [#9417](#9417) - update intro messaging by @jdx in [#9418](#9418)

greptile-apps Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread src/http.rs

Comment thread crates/vfox/src/http.rs Outdated

cursor Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread src/http.rs

jdx and others added 2 commits April 26, 2026 17:07

greptile-apps Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread crates/vfox/src/lua_mod/http.rs Outdated

jdx and others added 2 commits April 26, 2026 17:21

cursor Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread src/http.rs

cursor Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread src/http.rs

jdx and others added 3 commits April 26, 2026 17:48

chore(deps): regenerate Cargo.lock after dropping tokio-retry

ccb2d06

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jdx mentioned this pull request Apr 26, 2026

fix(release): purge mise.en.dev CDN zone after each S3 publish #9416

Merged

3 tasks

jdx enabled auto-merge (squash) April 26, 2026 23:36

jdx merged commit 5b9e303 into main Apr 26, 2026
50 of 53 checks passed

jdx deleted the claude/agitated-bouman-7eb98c branch April 26, 2026 23:40

mise-en-dev mentioned this pull request Apr 26, 2026

chore: release 2026.4.24 #9413

Merged

BrewTestBot mentioned this pull request Apr 27, 2026

mise 2026.4.24 Homebrew/homebrew-core#279637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(http): retry transient HTTP failures with backoff and warn on rescue#9414

fix(http): retry transient HTTP failures with backoff and warn on rescue#9414
jdx merged 10 commits intomainfrom
claude/agitated-bouman-7eb98c

jdx commented Apr 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot commented Apr 26, 2026

Uh oh!

greptile-apps Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

jdx commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jdx commented Apr 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Apr 26, 2026

Uh oh!

greptile-apps Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hyperfine Performance

mise x -- echo

mise env

mise hook-env

mise ls

xtasks/test/perf

Uh oh!

jdx commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdx commented Apr 26, 2026 •

edited by cursor Bot

Loading

greptile-apps Bot commented Apr 26, 2026 •

edited

Loading

github-actions Bot commented Apr 26, 2026 •

edited

Loading

`mise x -- echo`

`mise env`

`mise hook-env`

`mise ls`

`xtasks/test/perf`