[Merged by Bors] - perf(Cache): skip decompression for already-unpacked files#34667
[Merged by Bors] - perf(Cache): skip decompression for already-unpacked files#34667kim-em wants to merge 3 commits intoleanprover-community:masterfrom
Conversation
Previously `lake exe cache get` would pass all cached ltar files to leantar for decompression, even when the files were already unpacked. Leantar would check each trace file and skip extraction, but this still required opening and parsing thousands of files. This PR adds pre-filtering in the cache script: before calling leantar, we check if each module's trace file exists and has a matching depHash. We compare the Lake depHash from the ltar file header (not the mathlib cache hash) with the trace file's depHash field. Modules that are already correctly unpacked are filtered out, so leantar only processes files that actually need decompression. Performance improvement: - Before: ~25 seconds for "No files to download" case - After: ~5 seconds for same case Fixes issue discussed at: https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses Co-Authored-By: Claude Opus 4.5 <[email protected]>
PR summary 5cb27be88dImport changes for modified filesNo significant changes to the import graph Import changes for all files
|
|
In my view, this is a much appreciated QoL improvement. Unfortunately, reviewing it is above my paygrade... |
|
I could review this next Friday, too, if nobody else beats me to it. Btw, please label PRs about |
joneugster
left a comment
There was a problem hiding this comment.
Looks good to me, thank you!
Since my suggestions are only about wording and moving the helper functions around, I suggest
maintainer delegate
|
🚀 Pull request has been placed on the maintainer queue by joneugster. |
- Move hex parsing functions to Cache/Lean.lean near related hex utilities - Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions - Make parseHexToUInt64? pure (Option instead of IO) - Use "decompressed" consistently instead of mixing with "unpacked" - Improve message when all files already decompressed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…comments Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and: - Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and String.parseHexToUInt64? from Cache/Lean.lean - Use FilePath instead of String for hashFromFileName and mathlibDepPath (call .toString only when needed for JSON) - Add docstring noting harvestDecompTask returns (successful, failed) tuple - Fix typo: print stdout not stderr for lake stdout 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
| let some handle ← try | ||
| some <$> IO.FS.Handle.mk ltarPath .read | ||
| catch _ => pure none | return none |
There was a problem hiding this comment.
Is this the same as
| let some handle ← try | |
| some <$> IO.FS.Handle.mk ltarPath .read | |
| catch _ => pure none | return none | |
| let handle ← try IO.FS.Handle.mk ltarPath .read | |
| catch _ => return none |
?
|
Since it's only one remaining stylistic suggestion above I'll ping this again: maintainer delegate |
|
🚀 Pull request has been placed on the maintainer queue by joneugster. |
|
Let's try and see how it works! bors merge |
This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code
|
Build failed: |
|
There's an issue with the changes in #34847 which is making it difficult to merge PRs that don't cause new cache files to get uploaded. I've already pinged @marcelolynch; in the meantime we can try to merge this in a batch with some other PRs that do result in uploads and it should go through. |
|
bors r+ |
This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code
|
Pull request successfully merged into master. Build succeeded: |
…cache is skipped (#35389) Currently some PRs sent to `bors` have ended up failing (#35330, #34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>
…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>
…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>
…r-community#34667) This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code
…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>
…r-community#34667) This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code
…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>
This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: #34667
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667
This PR adds pre-filtering to
lake exe cache get: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.Performance (tested on Mac Studio with SSD, ~7900 cached files):
Timing breakdown (after this PR):
What changed:
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's
depHashfield. Files with matching hashes are filtered out.Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.
Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses
🤖 Prepared with Claude Code