[Merged by Bors] - feat: pipeline downloads and decompression in `cache get` by kim-em · Pull Request #32987 · leanprover-community/mathlib4

kim-em · 2025-12-17T00:42:50Z

This PR modifies lake exe cache get to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using curl --parallel, then decompress all files using a single leantar call. Now a background task spawns sequential batched leantar calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

depends on: [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files #34667

Previously `cache get` downloaded all files sequentially, then decompressed all files sequentially. This change pipelines the operations: files are decompressed as they download, overlapping network I/O and disk I/O. Implementation uses a producer-consumer pattern with an append-only log and read pointer to avoid race conditions. Downloads append files to a log, while a background task spawns sequential batched leantar calls to decompress files as they arrive. Expected performance improvement: ~30% reduction in wall-clock time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

github-actions · 2025-12-17T00:43:56Z

PR summary 44faade374

Import changes for modified files

No significant changes to the import graph

Import changes for all files

Files	Import difference

Declarations diff

+ DecompConfig
+ decompressBatch
+ dispatchDecompBatch
+ harvestDecompTask
+ hashFromFileName
+ spawnLeanTarDecompress

You can run this locally as follows

## summary with just the declaration names:
./scripts/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/declarations_diff.sh long <optional_commit>

The doc-module for scripts/declarations_diff.sh contains some details about this script.

No changes to technical debt.

You can run this locally as

./scripts/technical-debt-metrics.sh pr_summary

The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

Fix several issues identified in PR review: 1. Fix panic risk: Use getLast? instead of getLast! in hashFromFileName 2. Add error logging when hash extraction or module lookup fails 3. Add verification that all queued files were decompressed 4. Use String.dropSuffix for cleaner string manipulation 5. Make ANSI escape codes conditional on TERM environment variable 6. Improve error messages with specific details (exit codes, file counts) These changes improve robustness and debuggability of the pipelined cache decompression feature. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…compression

Previously `lake exe cache get` would pass all cached ltar files to leantar for decompression, even when the files were already unpacked. Leantar would check each trace file and skip extraction, but this still required opening and parsing thousands of files. This PR adds pre-filtering in the cache script: before calling leantar, we check if each module's trace file exists and has a matching depHash. We compare the Lake depHash from the ltar file header (not the mathlib cache hash) with the trace file's depHash field. Modules that are already correctly unpacked are filtered out, so leantar only processes files that actually need decompression. Performance improvement: - Before: ~25 seconds for "No files to download" case - After: ~5 seconds for same case Fixes issue discussed at: https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses Co-Authored-By: Claude Opus 4.5 <[email protected]>

kim-em · 2026-02-01T04:24:38Z

FYI: I've opened #34667 which addresses a related issue - skipping decompression for files that are already unpacked with matching hashes.

This PR (#32987) pipelines decompression with downloads, which is great for the initial cache fetch. However, #34667 addresses the case where cache get is run repeatedly without any downloads - previously this would still spend 25+ seconds "decompressing" 8000 files that leantar would skip internally.

For the integration between these PRs:

Hex parsing functions: Both PRs add hexDigitToNat and parseHexString. We should consolidate these (probably move them to Cache/IO.lean since that's where other shared helpers live, or to Cache/Lean.lean).
Skip logic in pipelined path: When [Merged by Bors] - feat: pipeline downloads and decompression in cache get #32987 decompresses during download, it should also check if files are already unpacked. Otherwise, re-running cache get on an already-populated cache would still decompress everything. The needsDecompression function from [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files #34667 could be reused here.
Dependency: [Merged by Bors] - feat: pipeline downloads and decompression in cache get #32987 should probably be rebased on top of [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files #34667 once that merges, to avoid the duplicate code and ensure both paths (pipelined and post-download) benefit from the skip optimization.

joneugster · 2026-02-01T11:04:32Z

Thank you for the PR, I'm going to look at it in detail next Friday during my hours working for the mathlib initiative

joneugster

I've spent a long time understanding all of the changes, but it looks good to me. The logic about batching file decompression during downloading and then "harvesting" the result of this batches makes sense. Writing and reading from/to Stdin could be a little brittle, but I think this is the only way to get results from the tasks back and the implementation looks sound to me.

Cache/Requests.lean

- Move hex parsing functions to Cache/Lean.lean near related hex utilities - Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions - Make parseHexToUInt64? pure (Option instead of IO) - Use "decompressed" consistently instead of mixing with "unpacked" - Improve message when all files already decompressed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ression

…comments Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and: - Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and String.parseHexToUInt64? from Cache/Lean.lean - Use FilePath instead of String for hashFromFileName and mathlibDepPath (call .toString only when needed for JSON) - Add docstring noting harvestDecompTask returns (successful, failed) tuple - Fix typo: print stdout not stderr for lake stdout 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Use nested `FilePath.fileStem` calls instead of string `dropSuffix` for robustness, per review feedback. This correctly handles both `.ltar` and `.ltar.part` files using the platform-aware `FilePath` API. Co-Authored-By: Claude Opus 4.6 <[email protected]>

joneugster

I've successfully (re-) tested lake exe cache get and get! on this branch and on a new project requiring mathlib from this branch. Both seem to work as expected.

Since the remaining comments are only about stylistic suggestions, which might not justify delaying this really useful feature any more than necessary, I suggest

maintainer delegate

(EDIT: since I've suggested maintainer-delegate I'm going to remove awaiting-author again. I think otherwise it might not show up on the relevant queues, does it?)

Cache/Requests.lean

github-actions · 2026-02-27T10:44:22Z

🚀 Pull request has been placed on the maintainer queue by joneugster.

Extract `spawnLeanTarDecompress` helper to deduplicate leantar spawning between `decompressBatch` and `unpackCache`. Use `let some ... | ...` pattern instead of nested matches in `monitorCurl`. Fix colon placement in `decompressBatch` signature. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Cache/Requests.lean

- Fix non-parallel mode silently skipping decompression by restoring IO.unpackCache fallback in getFiles - Log decompression errors instead of silently swallowing them - Return error from harvestDecompTask so callers can log details - Use finalPath instead of .part path in hashFromFileName - Count hash/module lookup failures in decompFailed - Add defaults for isMathlibRoot/mathlibDepPath parameters Co-Authored-By: Claude Opus 4.6 <[email protected]>

Cache/Requests.lean

kim-em

Self-review: 6 issues found during code review.

Cache/Requests.lean

Cache/IO.lean

jcommelin

Thanks 🎉

bors merge

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: #34667

mathlib-bors · 2026-03-06T16:19:34Z

Pull request successfully merged into master.

Build succeeded:

The new parallel decompression (#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.

…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667

The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.

…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667

The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.

…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667

The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.

…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667

The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.

kim-em changed the title ~~feat: pipeline downloads and decompression in cache get~~ feat: pipeline downloads and decompression in cache get Dec 17, 2025

kim-em and others added 5 commits December 17, 2025 11:47

cleanup

288326d

simplify

e251ffb

revert spurious change

ff26b3e

Merge remote-tracking branch 'upstream/master' into pipeline-cache-de…

c68c88c

…compression

leanprover-community-bot-assistant assigned ocfnash Dec 21, 2025

ocfnash removed their assignment Dec 22, 2025

leanprover-community-bot-assistant assigned joelriou Dec 23, 2025

joelriou removed their assignment Dec 27, 2025

leanprover-community-bot-assistant assigned dagurtomas Dec 28, 2025

dagurtomas removed their assignment Jan 22, 2026

leanprover-community-bot-assistant assigned EtienneC30 Jan 23, 2026

joneugster added the t-meta Tactics, attributes or user commands label Feb 1, 2026

joneugster assigned joneugster and unassigned EtienneC30 Feb 1, 2026

mathlib-dependent-issues bot added the blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) label Feb 6, 2026

joneugster reviewed Feb 6, 2026

View reviewed changes

joneugster added the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 6, 2026

kim-em and others added 4 commits February 9, 2026 04:04

Merge branch 'skip-already-unpacked-cache' into pipeline-cache-decomp…

9da3c75

…ression

kim-em removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) awaiting-author A reviewer has asked the author a question or requested changes. labels Feb 12, 2026

Vierkantor added the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 27, 2026

joneugster approved these changes Feb 27, 2026

View reviewed changes

Cache/Requests.lean Outdated Show resolved Hide resolved

Cache/Requests.lean Outdated Show resolved Hide resolved

Cache/Requests.lean Outdated Show resolved Hide resolved

mathlib-triage bot added the maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. label Feb 27, 2026

joneugster removed the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 27, 2026

kim-em commented Mar 1, 2026

View reviewed changes

kim-em commented Mar 2, 2026

View reviewed changes

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

kim-em commented Mar 2, 2026

View reviewed changes

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

kim-em commented Mar 2, 2026

View reviewed changes

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Outdated Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

Cache/Requests.lean Show resolved Hide resolved

Cache/IO.lean Show resolved Hide resolved

joneugster approved these changes Mar 2, 2026

View reviewed changes

jcommelin reviewed Mar 6, 2026

View reviewed changes

mathlib-triage bot added ready-to-merge This PR has been sent to bors. and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Mar 6, 2026

mathlib-bors bot changed the title ~~feat: pipeline downloads and decompression in cache get~~ [Merged by Bors] - feat: pipeline downloads and decompression in cache get Mar 6, 2026

mathlib-bors bot closed this Mar 6, 2026

bryangingechen mentioned this pull request Mar 8, 2026

[Merged by Bors] - fix: decompress already downloaded files #36367

Closed

kim-em mentioned this pull request Mar 10, 2026

perf: decompress already-cached files concurrently with downloads #36423

Open

Conversation

kim-em commented Dec 17, 2025 • edited by joneugster Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary 44faade374

Import changes for modified files

Declarations diff

Uh oh!

kim-em commented Feb 1, 2026

Uh oh!

joneugster commented Feb 1, 2026

Uh oh!

joneugster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joneugster left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kim-em left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcommelin left a comment

Choose a reason for hiding this comment

Uh oh!

mathlib-bors bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

kim-em commented Dec 17, 2025 •

edited by joneugster

Loading

github-actions bot commented Dec 17, 2025 •

edited

Loading

joneugster left a comment •

edited

Loading