Skip to content

[Merged by Bors] - feat: pipeline downloads and decompression in cache get#32987

Closed
kim-em wants to merge 15 commits intoleanprover-community:masterfrom
kim-em:pipeline-cache-decompression
Closed

[Merged by Bors] - feat: pipeline downloads and decompression in cache get#32987
kim-em wants to merge 15 commits intoleanprover-community:masterfrom
kim-em:pipeline-cache-decompression

Conversation

@kim-em
Copy link
Copy Markdown
Contributor

@kim-em kim-em commented Dec 17, 2025

This PR modifies lake exe cache get to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using curl --parallel, then decompress all files using a single leantar call. Now a background task spawns sequential batched leantar calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

Previously `cache get` downloaded all files sequentially, then decompressed
all files sequentially. This change pipelines the operations: files are
decompressed as they download, overlapping network I/O and disk I/O.

Implementation uses a producer-consumer pattern with an append-only log
and read pointer to avoid race conditions. Downloads append files to a log,
while a background task spawns sequential batched leantar calls to decompress
files as they arrive.

Expected performance improvement: ~30% reduction in wall-clock time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 17, 2025

PR summary 44faade374

Import changes for modified files

No significant changes to the import graph

Import changes for all files
Files Import difference

Declarations diff

+ DecompConfig
+ decompressBatch
+ dispatchDecompBatch
+ harvestDecompTask
+ hashFromFileName
+ spawnLeanTarDecompress

You can run this locally as follows
## summary with just the declaration names:
./scripts/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/declarations_diff.sh long <optional_commit>

The doc-module for scripts/declarations_diff.sh contains some details about this script.


No changes to technical debt.

You can run this locally as

./scripts/technical-debt-metrics.sh pr_summary
  • The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
  • The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

@kim-em kim-em changed the title feat: pipeline downloads and decompression in cache get feat: pipeline downloads and decompression in cache get Dec 17, 2025
kim-em and others added 5 commits December 17, 2025 11:47
Fix several issues identified in PR review:

1. Fix panic risk: Use getLast? instead of getLast! in hashFromFileName
2. Add error logging when hash extraction or module lookup fails
3. Add verification that all queued files were decompressed
4. Use String.dropSuffix for cleaner string manipulation
5. Make ANSI escape codes conditional on TERM environment variable
6. Improve error messages with specific details (exit codes, file counts)

These changes improve robustness and debuggability of the pipelined
cache decompression feature.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Previously `lake exe cache get` would pass all cached ltar files to
leantar for decompression, even when the files were already unpacked.
Leantar would check each trace file and skip extraction, but this still
required opening and parsing thousands of files.

This PR adds pre-filtering in the cache script: before calling leantar,
we check if each module's trace file exists and has a matching depHash.
We compare the Lake depHash from the ltar file header (not the mathlib
cache hash) with the trace file's depHash field.

Modules that are already correctly unpacked are filtered out, so leantar
only processes files that actually need decompression.

Performance improvement:
- Before: ~25 seconds for "No files to download" case
- After: ~5 seconds for same case

Fixes issue discussed at:
https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@kim-em
Copy link
Copy Markdown
Contributor Author

kim-em commented Feb 1, 2026

FYI: I've opened #34667 which addresses a related issue - skipping decompression for files that are already unpacked with matching hashes.

This PR (#32987) pipelines decompression with downloads, which is great for the initial cache fetch. However, #34667 addresses the case where cache get is run repeatedly without any downloads - previously this would still spend 25+ seconds "decompressing" 8000 files that leantar would skip internally.

For the integration between these PRs:

  1. Hex parsing functions: Both PRs add hexDigitToNat and parseHexString. We should consolidate these (probably move them to Cache/IO.lean since that's where other shared helpers live, or to Cache/Lean.lean).

  2. Skip logic in pipelined path: When [Merged by Bors] - feat: pipeline downloads and decompression in cache get #32987 decompresses during download, it should also check if files are already unpacked. Otherwise, re-running cache get on an already-populated cache would still decompress everything. The needsDecompression function from [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files #34667 could be reused here.

  3. Dependency: [Merged by Bors] - feat: pipeline downloads and decompression in cache get #32987 should probably be rebased on top of [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files #34667 once that merges, to avoid the duplicate code and ensure both paths (pipelined and post-download) benefit from the skip optimization.

@joneugster joneugster added the t-meta Tactics, attributes or user commands label Feb 1, 2026
@joneugster joneugster assigned joneugster and unassigned EtienneC30 Feb 1, 2026
@joneugster
Copy link
Copy Markdown
Contributor

Thank you for the PR, I'm going to look at it in detail next Friday during my hours working for the mathlib initiative

@mathlib-dependent-issues mathlib-dependent-issues bot added the blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) label Feb 6, 2026
Copy link
Copy Markdown
Contributor

@joneugster joneugster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent a long time understanding all of the changes, but it looks good to me. The logic about batching file decompression during downloading and then "harvesting" the result of this batches makes sense. Writing and reading from/to Stdin could be a little brittle, but I think this is the only way to get results from the tasks back and the implementation looks sound to me.

@joneugster joneugster added the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 6, 2026
kim-em and others added 4 commits February 9, 2026 04:04
- Move hex parsing functions to Cache/Lean.lean near related hex utilities
- Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions
- Make parseHexToUInt64? pure (Option instead of IO)
- Use "decompressed" consistently instead of mixing with "unpacked"
- Improve message when all files already decompressed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…comments

Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and:

- Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and
  String.parseHexToUInt64? from Cache/Lean.lean
- Use FilePath instead of String for hashFromFileName and mathlibDepPath
  (call .toString only when needed for JSON)
- Add docstring noting harvestDecompTask returns (successful, failed) tuple
- Fix typo: print stdout not stderr for lake stdout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Use nested `FilePath.fileStem` calls instead of string `dropSuffix` for
robustness, per review feedback. This correctly handles both `.ltar` and
`.ltar.part` files using the platform-aware `FilePath` API.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@kim-em kim-em removed blocked-by-other-PR This PR depends on another PR (this label is automatically managed by a bot) awaiting-author A reviewer has asked the author a question or requested changes. labels Feb 12, 2026
@Vierkantor Vierkantor added the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 27, 2026
Copy link
Copy Markdown
Contributor

@joneugster joneugster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've successfully (re-) tested lake exe cache get and get! on this branch and on a new project requiring mathlib from this branch. Both seem to work as expected.

Since the remaining comments are only about stylistic suggestions, which might not justify delaying this really useful feature any more than necessary, I suggest

maintainer delegate

(EDIT: since I've suggested maintainer-delegate I'm going to remove awaiting-author again. I think otherwise it might not show up on the relevant queues, does it?)

@github-actions
Copy link
Copy Markdown

🚀 Pull request has been placed on the maintainer queue by joneugster.

@mathlib-triage mathlib-triage bot added the maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. label Feb 27, 2026
@joneugster joneugster removed the awaiting-author A reviewer has asked the author a question or requested changes. label Feb 27, 2026
Extract `spawnLeanTarDecompress` helper to deduplicate leantar spawning
between `decompressBatch` and `unpackCache`. Use `let some ... | ...`
pattern instead of nested matches in `monitorCurl`. Fix colon placement
in `decompressBatch` signature.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Fix non-parallel mode silently skipping decompression by restoring
  IO.unpackCache fallback in getFiles
- Log decompression errors instead of silently swallowing them
- Return error from harvestDecompTask so callers can log details
- Use finalPath instead of .part path in hashFromFileName
- Count hash/module lookup failures in decompFailed
- Add defaults for isMathlibRoot/mathlibDepPath parameters

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor Author

@kim-em kim-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: 6 issues found during code review.

Copy link
Copy Markdown
Member

@jcommelin jcommelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🎉

bors merge

@mathlib-triage mathlib-triage bot added ready-to-merge This PR has been sent to bors. and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Mar 6, 2026
mathlib-bors bot pushed a commit that referenced this pull request Mar 6, 2026
This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: #34667
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Mar 6, 2026

Pull request successfully merged into master.

Build succeeded:

@mathlib-bors mathlib-bors bot changed the title feat: pipeline downloads and decompression in cache get [Merged by Bors] - feat: pipeline downloads and decompression in cache get Mar 6, 2026
@mathlib-bors mathlib-bors bot closed this Mar 6, 2026
mathlib-bors bot pushed a commit that referenced this pull request Mar 8, 2026
The new parallel decompression (#32987) forgets about files which are already present in the `.cache/mathlib` folder.

This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 8, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 8, 2026
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder.

This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 9, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 9, 2026
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder.

This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
seewoo5 pushed a commit to seewoo5/mathlib4 that referenced this pull request Mar 9, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
seewoo5 pushed a commit to seewoo5/mathlib4 that referenced this pull request Mar 9, 2026
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder.

This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 10, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 10, 2026
The new parallel decompression (leanprover-community#32987) forgets about files which are already present in the `.cache/mathlib` folder.

This PR adds the final decompression step back which only decompresses anything which hasn't already been decompressed prior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge This PR has been sent to bors. t-meta Tactics, attributes or user commands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants