[Merged by Bors] - feat: pipeline downloads and decompression in cache get#32987
[Merged by Bors] - feat: pipeline downloads and decompression in cache get#32987kim-em wants to merge 15 commits intoleanprover-community:masterfrom
cache get#32987Conversation
Previously `cache get` downloaded all files sequentially, then decompressed all files sequentially. This change pipelines the operations: files are decompressed as they download, overlapping network I/O and disk I/O. Implementation uses a producer-consumer pattern with an append-only log and read pointer to avoid race conditions. Downloads append files to a log, while a background task spawns sequential batched leantar calls to decompress files as they arrive. Expected performance improvement: ~30% reduction in wall-clock time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
PR summary 44faade374Import changes for modified filesNo significant changes to the import graph Import changes for all files
|
cache get
Fix several issues identified in PR review: 1. Fix panic risk: Use getLast? instead of getLast! in hashFromFileName 2. Add error logging when hash extraction or module lookup fails 3. Add verification that all queued files were decompressed 4. Use String.dropSuffix for cleaner string manipulation 5. Make ANSI escape codes conditional on TERM environment variable 6. Improve error messages with specific details (exit codes, file counts) These changes improve robustness and debuggability of the pipelined cache decompression feature. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Previously `lake exe cache get` would pass all cached ltar files to leantar for decompression, even when the files were already unpacked. Leantar would check each trace file and skip extraction, but this still required opening and parsing thousands of files. This PR adds pre-filtering in the cache script: before calling leantar, we check if each module's trace file exists and has a matching depHash. We compare the Lake depHash from the ltar file header (not the mathlib cache hash) with the trace file's depHash field. Modules that are already correctly unpacked are filtered out, so leantar only processes files that actually need decompression. Performance improvement: - Before: ~25 seconds for "No files to download" case - After: ~5 seconds for same case Fixes issue discussed at: https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
FYI: I've opened #34667 which addresses a related issue - skipping decompression for files that are already unpacked with matching hashes. This PR (#32987) pipelines decompression with downloads, which is great for the initial cache fetch. However, #34667 addresses the case where For the integration between these PRs:
|
|
Thank you for the PR, I'm going to look at it in detail next Friday during my hours working for the mathlib initiative |
joneugster
left a comment
There was a problem hiding this comment.
I've spent a long time understanding all of the changes, but it looks good to me. The logic about batching file decompression during downloading and then "harvesting" the result of this batches makes sense. Writing and reading from/to Stdin could be a little brittle, but I think this is the only way to get results from the tasks back and the implementation looks sound to me.
- Move hex parsing functions to Cache/Lean.lean near related hex utilities - Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions - Make parseHexToUInt64? pure (Option instead of IO) - Use "decompressed" consistently instead of mixing with "unpacked" - Improve message when all files already decompressed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…comments Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and: - Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and String.parseHexToUInt64? from Cache/Lean.lean - Use FilePath instead of String for hashFromFileName and mathlibDepPath (call .toString only when needed for JSON) - Add docstring noting harvestDecompTask returns (successful, failed) tuple - Fix typo: print stdout not stderr for lake stdout 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Use nested `FilePath.fileStem` calls instead of string `dropSuffix` for robustness, per review feedback. This correctly handles both `.ltar` and `.ltar.part` files using the platform-aware `FilePath` API. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
I've successfully (re-) tested lake exe cache get and get! on this branch and on a new project requiring mathlib from this branch. Both seem to work as expected.
Since the remaining comments are only about stylistic suggestions, which might not justify delaying this really useful feature any more than necessary, I suggest
maintainer delegate
(EDIT: since I've suggested maintainer-delegate I'm going to remove awaiting-author again. I think otherwise it might not show up on the relevant queues, does it?)
|
🚀 Pull request has been placed on the maintainer queue by joneugster. |
Extract `spawnLeanTarDecompress` helper to deduplicate leantar spawning between `decompressBatch` and `unpackCache`. Use `let some ... | ...` pattern instead of nested matches in `monitorCurl`. Fix colon placement in `decompressBatch` signature. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Fix non-parallel mode silently skipping decompression by restoring IO.unpackCache fallback in getFiles - Log decompression errors instead of silently swallowing them - Return error from harvestDecompTask so callers can log details - Use finalPath instead of .part path in hashFromFileName - Count hash/module lookup failures in decompFailed - Add defaults for isMathlibRoot/mathlibDepPath parameters Co-Authored-By: Claude Opus 4.6 <[email protected]>
kim-em
left a comment
There was a problem hiding this comment.
Self-review: 6 issues found during code review.
This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: #34667
|
Pull request successfully merged into master. Build succeeded: |
cache getcache get
The new parallel decompression (#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder. This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
This PR modifies
lake exe cache getto decompress files as they download, rather than waiting for all downloads to complete first.Previously the cache system had two sequential phases: download all files using
curl --parallel, then decompress all files using a singleleantarcall. Now a background task spawns sequential batchedleantarcalls to decompress files as downloads complete, pipelining network I/O and disk I/O.🤖 Prepared with Claude Code