fix(http): retry transient HTTP failures with backoff and warn on rescue#9414
fix(http): retry transient HTTP failures with backoff and warn on rescue#9414
Conversation
5xx responses (e.g. GitHub's occasional 502 on release downloads) and network-layer drops were failing installs immediately because http_retries defaulted to 0 and the wrapped retry layer never fired. Turn retries on by default and add a transient-vs-permanent classifier so 4xx still fails fast. Cover the chunk-streaming path in download_file_with_headers too — the original retry only wrapped the initial GET, so a connection drop mid-tarball would still fail. Mirror the same behavior in the vfox crate's downloads. When a retry rescues a request, log a warn! with the original error so flaky infrastructure doesn't silently mask itself. Tighten the http→https fallback to fire only on connection-level errors, not on HTTP status errors — falling back to https after the server already returned a 4xx makes no sense and was silently double-querying on every non-2xx response. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Greptile SummaryThis PR enables HTTP retries by default ( Three P2 gaps noted:
Confidence Score: 5/5Safe to merge; all findings are P2 improvement suggestions with no present defects in the changed paths. All three findings are P2 (non-blocking): partial retry coverage for non-download body reads, missing Retry-After honoring, and correlated jitter in vfox. None cause incorrect behavior for the primary use case (file downloads), and the new unit tests directly validate the retry/exhaustion/no-retry contracts. src/http.rs (body-read retry gap and 429 Retry-After), crates/vfox/src/http.rs (jitter quality) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[HTTP request entry point] --> B{download?}
B -- yes --> C[download_file_with_headers]
B -- no --> D[get_async_with_headers / head_async_with_headers]
C --> E[retry_async loop max http_retries attempts]
E --> F[send_once_with_https_fallback]
F --> G[send_once]
G --> H{status error?}
H -- 5xx/408/429 --> I{is_transient?}
H -- 4xx other --> J[return Err immediately]
I -- yes + retries left --> K[jittered backoff sleep ~200ms/1s/4s/15s]
K --> E
I -- no / retries exhausted --> J
H -- ok --> L[stream chunks to tempfile]
L --> M{chunk error?}
M -- is_body error --> I
M -- ok --> N[persist to path]
D --> O[send_with_https_fallback]
O --> E2[retry_async loop]
E2 --> F2[send_once_with_https_fallback]
F2 --> G2[send_once error_for_status_ref]
G2 --> I2{is_transient?}
I2 -- yes + retries left --> K2[jittered backoff sleep]
K2 --> E2
I2 -- no / exhausted --> J2[return Err]
G2 -- ok --> P[return Response]
P --> Q[resp.bytes / .text / .json NOT retried]
Reviews (7): Last reviewed commit: "chore(deps): regenerate Cargo.lock after..." | Re-trigger Greptile |
Hyperfine Performance
|
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.4.22 x -- echo |
24.7 ± 0.9 | 23.5 | 39.9 | 1.00 |
mise x -- echo |
25.5 ± 1.2 | 24.1 | 35.6 | 1.03 ± 0.06 |
mise env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.4.22 env |
23.5 ± 0.7 | 22.6 | 29.2 | 1.00 |
mise env |
24.1 ± 0.9 | 23.3 | 36.5 | 1.03 ± 0.05 |
mise hook-env
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.4.22 hook-env |
25.2 ± 0.8 | 23.6 | 28.7 | 1.00 |
mise hook-env |
26.5 ± 1.1 | 24.7 | 44.9 | 1.05 ± 0.06 |
mise ls
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
mise-2026.4.22 ls |
25.6 ± 0.5 | 24.2 | 29.4 | 1.00 |
mise ls |
27.2 ± 0.4 | 26.1 | 28.9 | 1.06 ± 0.03 |
xtasks/test/perf
| Command | mise-2026.4.22 | mise | Variance |
|---|---|---|---|
| install (cached) | 174ms | 169ms | +2% |
| ls (cached) | 88ms | 86ms | +2% |
| bin-paths (cached) | 87ms | 88ms | -1% |
| task-ls (cached) | 809ms | 798ms | +1% |
- vfox crate now reads MISE_HTTP_RETRIES at runtime instead of using a hardcoded retry count, so the documented opt-out works for vfox-driven downloads too. - Backoff schedule in vfox now matches the main crate (~200ms / ~1s / ~4s / ~15s) rather than a linear 200/400/600ms. - Replace `while let Some(_) = iter.next()` with `for _ in iter` to satisfy clippy::while_let_on_iterator (only enforced in CI, not local lint). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks for the review. Addressed both issues: Issue 2 (vfox ignores Issue 1 (5xx retry dead for non-download paths) — I think this is a misread. The two unit tests ( That said, the redundant This comment was generated by an AI coding assistant. |
The fixed 4-element schedule with `.take(retries)` capped any `MISE_HTTP_RETRIES > 4` at 4 retries — a regression vs. the previous unbounded `ExponentialBackoff::from_millis(10)`. Chain the schedule with a repeat of the longest delay (15s) so any retry count is honored. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Previously a transient failure was silent until the retry sequence either
succeeded (warn on rescue) or exhausted (raw error propagated). With the
new ~15s tail of the backoff schedule, the user would sit through long
delays with no feedback about what was going wrong.
Now every transient failure emits a warn! the moment it happens, with the
attempt number and next-retry delay. Successful rescues still warn
("succeeded on attempt N"), and exhausted retries also warn before
propagating the final error so the attempt count is visible.
Mirror the same behavior in the vfox crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The retry-status arm in vfox's send_with_retry only fires while attempts remain (`attempt + 1 < attempts`), so the final attempt's 5xx falls through to the success arm. With prior transient failures, that arm was warning "succeeded on attempt N" while returning a 5xx response — exactly the silent-mask the warnings were added to prevent. Distinguish real success from "ran out of retries with a bad status" by checking `resp.status().is_success()` and logging "failed after N attempts: HTTP <status>" otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Per-attempt warnings already announce each transient failure as it happens. The post-loop "succeeded on attempt N" / "failed after N attempts" warnings just duplicated information the user already saw or will see via the propagated response/error. Keep the per-attempt warns (they're the live signal), drop the rest. Same simplification across all three retry paths. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The retry refactor swapped get_async_with_headers for the lower-level send_once_with_https_fallback to avoid retry-on-retry, which dropped the offline-mode guard that get_async_with_headers does. Add it back at the top of download_file_with_headers so downloads honor MISE_OFFLINE again. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2cd5b8f. Configure here.
tokio_retry's `jitter` returns a value in [0, d), so a 200ms base could become 2ms — effectively no backoff. Replace with an "equal jitter" helper (random in [d/2, d)) matching what the vfox crate already does. This keeps the documented schedule (~200ms / ~1s / ~4s / ~15s) honest at the lower bound. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Replaced its `jitter` helper with a local equal-jitter implementation in the previous commit; the crate's `Retry` driver was already removed when the manual loop landed. cargo machete flagged it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
## Summary Add `mise.en.dev` to the list of Cloudflare zones purged at the end of `scripts/publish-s3.sh`. Previously only `jdx.dev` and `mise.run` were being purged. ## Why `install.sh` and `install.sh.minisig` are uploaded to S3 with `cache-control: max-age=86400,s-maxage=86400,public,immutable`. Without an explicit purge per CDN zone, each zone keeps serving the previous release's bytes for up to 24 hours — even after S3 has the new bytes. Since #9411 made `mise.en.dev` the canonical bootstrap host (used by `mise generate tool-stub --bootstrap` and `mise generate bootstrap`), this manifested as: `mise.en.dev/install.sh` serving the v(N-1) script next to a v(N) `install.sh.minisig`, causing minisign verification to fail. Caught today as recurring CI failures on [#9414](#9414) (e2e-0 / e2e-1). The other half — that `scripts/update-redirect.sh` was deleted in #9411 — turned out not to be related; that script only updated a `mise-latest-*` redirect rule, not the install.sh path. The real issue is just the missing purge. ## Test plan - [x] Bash syntax check (`bash -n scripts/publish-s3.sh`) - [x] Verified the en.dev zone ID `531d003297f1f4ae2415b41f7f5da8fa` matches the value previously used in the now-deleted `scripts/update-redirect.sh` (commit `68075d866`) - [ ] On the next release, confirm in the workflow logs that all three purges run, and that `curl https://mise.en.dev/install.sh` returns the new version's content within seconds of the deploy completing 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk: only adjusts post-publish CDN cache purging logic to include an additional zone and reduce duplication; no changes to artifact generation or upload behavior. > > **Overview** > After publishing release artifacts to S3, `scripts/publish-s3.sh` now purges Cloudflare cache for **all** relevant CDN zones via a looped `ZONES` list, adding the missing `en.dev`/`mise.en.dev` zone. > > This replaces the two hardcoded purge calls with a single per-zone purge step to prevent mixed-version `install.sh`/signature artifacts being served from different zones under `immutable` caching. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit e083358. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
### 🚀 Features - **(ls-remote)** add `prereleases` setting and `--prerelease` flag by @jdx in [#9415](#9415) ### 🐛 Bug Fixes - **(http)** retry transient HTTP failures with backoff and warn on rescue by @jdx in [#9414](#9414) - **(release)** purge mise.en.dev CDN zone after each S3 publish by @jdx in [#9416](#9416) ### 📚 Documentation - prefix GitHub star count with ★ glyph by @jdx in [#9417](#9417) - update intro messaging by @jdx in [#9418](#9418)

Summary
http_retriesdefaulted to 0, so a one-off 502 from GitHub releases (like the one in run #24963676401) failed installs immediately even though the retry plumbing was already there.download_file_with_headerstoo — the original retry only wrapped the initial GET, so a connection drop mid-tarball still failed. Mirrored the same coverage in thevfoxcrate'sdownload()and luadownload_file.warn!whenever a retry rescues a request, with the original transient error included, so flaky infrastructure doesn't silently mask itself.ExponentialBackoff::from_millis(10)(10/100/1000ms — too short for real server outages). Worst-case added latency at default settings stays around ~20s.MISE_HTTP_RETRIES=0still works as an opt-out.Test plan
src/http.rsusing a tiny in-process TCP server cover: retry-rescues-after-502s, no-retry-on-404, retry-exhaustion-on-persistent-500, andMISE_HTTP_RETRIES=0regression check.mise run test:unit).vfoxcrate tests still pass (73 passed).mise run lintclean.MISE_DEBUG=1 mise install aqua:mvdan/sh@latest(the originally failing tool).🤖 Generated with Claude Code
Note
Medium Risk
Changes core HTTP request/download behavior by enabling retries by default and altering retry/backoff and http→https fallback logic, which can affect install/update flows and error surfaces under flaky networks.
Overview
Enables transient HTTP failure retries by default (
http_retriesnow defaults to 3) and documents the new behavior insettings.tomlandschema/mise.json.Replaces
tokio-retrywith a customretry_async/backoff implementation insrc/http.rs, including transient error classification (5xx/408/429 + network/connect/timeout/body drops), warn logging on each rescued attempt, and a clearer jittered schedule (~200ms/~1s/~4s/~15s). It also restructures downloads to retry the entire streamed body and tightens thehttp→httpsfallback to only trigger on connection-level failures.Mirrors the same retry semantics in the
vfoxcrate (including Lua HTTP download paths) and adds focused unit tests validating retry-on-5xx, no-retry-on-404, exhaustion behavior, and retries-disabled handling.Reviewed by Cursor Bugbot for commit ccb2d06. Bugbot is set up for automated code reviews on this repo. Configure here.