[Merged by Bors] - perf(Cache): skip decompression for already-unpacked files by kim-em · Pull Request #34667 · leanprover-community/mathlib4

kim-em · 2026-02-01T04:23:51Z

This PR adds pre-filtering to lake exe cache get: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

Performance (tested on Mac Studio with SSD, ~7900 cached files):

Scenario	Before	After
All files already unpacked	~25s	~5s
Partial decompression needed	~25s	~5s + decompression time

Timing breakdown (after this PR):

Hash computation: ~2.8s
Filter to existing ltar files: ~0.06s
Read ltar headers + trace files, compare hashes: ~2.5s
leantar (if needed): 0s when all files match

What changed:
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's depHash field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code

Previously `lake exe cache get` would pass all cached ltar files to leantar for decompression, even when the files were already unpacked. Leantar would check each trace file and skip extraction, but this still required opening and parsing thousands of files. This PR adds pre-filtering in the cache script: before calling leantar, we check if each module's trace file exists and has a matching depHash. We compare the Lake depHash from the ltar file header (not the mathlib cache hash) with the trace file's depHash field. Modules that are already correctly unpacked are filtered out, so leantar only processes files that actually need decompression. Performance improvement: - Before: ~25 seconds for "No files to download" case - After: ~5 seconds for same case Fixes issue discussed at: https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-02-01T04:24:59Z

PR summary 5cb27be88d

Import changes for modified files

No significant changes to the import graph

Import changes for all files

Files	Import difference

Declarations diff

+ Char.hexDigitToNat?
+ ModuleHashMap.filterNeedsDecompression
+ String.parseHexToUInt64?
+ getTracePath
+ needsDecompression
+ readLtarHash
+ readTraceHash

You can run this locally as follows

## summary with just the declaration names:
./scripts/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/declarations_diff.sh long <optional_commit>

The doc-module for script/declarations_diff.sh contains some details about this script.

No changes to technical debt.

You can run this locally as

./scripts/technical-debt-metrics.sh pr_summary

The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

MichaelStollBayreuth · 2026-02-01T09:27:22Z

In my view, this is a much appreciated QoL improvement. Unfortunately, reviewing it is above my paygrade...

joneugster · 2026-02-01T11:11:32Z

I could review this next Friday, too, if nobody else beats me to it.

Btw, please label PRs about Cache with either t-meta or ci as they might be missed without topic label. (this will be done automatically in #34677)

joneugster

Looks good to me, thank you!

Since my suggestions are only about wording and moving the helper functions around, I suggest

maintainer delegate

Cache/IO.lean

github-actions · 2026-02-06T10:01:03Z

🚀 Pull request has been placed on the maintainer queue by joneugster.

Cache/IO.lean

- Move hex parsing functions to Cache/Lean.lean near related hex utilities - Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions - Make parseHexToUInt64? pure (Option instead of IO) - Use "decompressed" consistently instead of mixing with "unpacked" - Improve message when all files already decompressed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…comments Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and: - Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and String.parseHexToUInt64? from Cache/Lean.lean - Use FilePath instead of String for hashFromFileName and mathlibDepPath (call .toString only when needed for JSON) - Add docstring noting harvestDecompTask returns (successful, failed) tuple - Fix typo: print stdout not stderr for lake stdout 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Cache/IO.lean

Co-Authored-By: Claude Opus 4.6 <[email protected]>

plp127 · 2026-02-12T23:24:55Z

Cache/IO.lean

+  let some handle ← try
+      some <$> IO.FS.Handle.mk ltarPath .read
+    catch _ => pure none | return none


Is this the same as

Suggested change

let some handle ← try

some <$> IO.FS.Handle.mk ltarPath .read

catch _ => pure none | return none

let handle ← try IO.FS.Handle.mk ltarPath .read

catch _ => return none

?

joneugster · 2026-02-13T09:03:31Z

Since it's only one remaining stylistic suggestion above I'll ping this again:

maintainer delegate

github-actions · 2026-02-13T09:03:58Z

🚀 Pull request has been placed on the maintainer queue by joneugster.

adomani · 2026-02-15T22:42:40Z

Let's try and see how it works!

bors merge

This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code

mathlib-bors · 2026-02-15T22:55:52Z

Build failed:

ci (staging) / Post-Build Step

bryangingechen · 2026-02-15T23:09:18Z

There's an issue with the changes in #34847 which is making it difficult to merge PRs that don't cause new cache files to get uploaded. I've already pinged @marcelolynch; in the meantime we can try to merge this in a batch with some other PRs that do result in uploads and it should go through.

bryangingechen · 2026-02-16T01:13:47Z

bors r+

This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code

mathlib-bors · 2026-02-16T01:26:03Z

Pull request successfully merged into master.

Build succeeded:

…cache is skipped (#35389) Currently some PRs sent to `bors` have ended up failing (#35330, #34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>

…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>

…r-community#34667) This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code

…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>

…r-community#34667) This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely. **Performance (tested on Mac Studio with SSD, ~7900 cached files):** | Scenario | Before | After | |----------|--------|-------| | All files already unpacked | ~25s | ~5s | | Partial decompression needed | ~25s | ~5s + decompression time | **Timing breakdown (after this PR):** - Hash computation: ~2.8s - Filter to existing ltar files: ~0.06s - Read ltar headers + trace files, compare hashes: ~2.5s - leantar (if needed): 0s when all files match **What changed:** Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out. Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally. Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses 🤖 Prepared with Claude Code

…cache is skipped (leanprover-community#35389) Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped. Co-authored-by: Bryan Gin-ge Chen <[email protected]>

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: #34667

…-community#32987) This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first. Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O. 🤖 Prepared with Claude Code - [x] depends on: leanprover-community#34667

kim-em mentioned this pull request Feb 1, 2026

[Merged by Bors] - feat: pipeline downloads and decompression in cache get #32987

Closed

1 task

joneugster added the t-meta Tactics, attributes or user commands label Feb 1, 2026

grunweg added the CI Modifies the continuous integration setup or other automation label Feb 1, 2026

grunweg assigned grunweg and joneugster and unassigned grunweg Feb 1, 2026

joneugster approved these changes Feb 6, 2026

View reviewed changes

Cache/IO.lean Outdated Show resolved Hide resolved

Cache/IO.lean Show resolved Hide resolved

Cache/IO.lean Outdated Show resolved Hide resolved

Cache/IO.lean Outdated Show resolved Hide resolved

mathlib-triage bot added the maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. label Feb 6, 2026

plp127 reviewed Feb 6, 2026

View reviewed changes

Cache/IO.lean Outdated Show resolved Hide resolved

plp127 reviewed Feb 9, 2026

View reviewed changes

Cache/IO.lean Outdated Show resolved Hide resolved

chore: simplify readTraceHash error handling

7da8c43

Co-Authored-By: Claude Opus 4.6 <[email protected]>

plp127 reviewed Feb 12, 2026

View reviewed changes

mathlib-triage bot added ready-to-merge This PR has been sent to bors. and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Feb 15, 2026

mathlib-bors bot changed the title ~~perf(Cache): skip decompression for already-unpacked files~~ [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files Feb 16, 2026

mathlib-bors bot closed this Feb 16, 2026

bryangingechen mentioned this pull request Feb 16, 2026

[Merged by Bors] - ci: Update the workflow so the chain doesn’t get skipped when upload_cache is skipped #35389

Closed

Conversation

kim-em commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary 5cb27be88d

Import changes for modified files

Declarations diff

Uh oh!

MichaelStollBayreuth commented Feb 1, 2026

Uh oh!

joneugster commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joneugster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

plp127 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

joneugster commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

adomani commented Feb 15, 2026

Uh oh!

mathlib-bors bot commented Feb 15, 2026

Uh oh!

bryangingechen commented Feb 15, 2026

Uh oh!

bryangingechen commented Feb 16, 2026

Uh oh!

mathlib-bors bot commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kim-em commented Feb 1, 2026 •

edited

Loading

github-actions bot commented Feb 1, 2026 •

edited

Loading

joneugster commented Feb 1, 2026 •

edited

Loading

joneugster commented Feb 13, 2026 •

edited

Loading