Skip to content

[Merged by Bors] - perf(Cache): skip decompression for already-unpacked files#34667

Closed
kim-em wants to merge 3 commits intoleanprover-community:masterfrom
kim-em:skip-already-unpacked-cache
Closed

[Merged by Bors] - perf(Cache): skip decompression for already-unpacked files#34667
kim-em wants to merge 3 commits intoleanprover-community:masterfrom
kim-em:skip-already-unpacked-cache

Conversation

@kim-em
Copy link
Copy Markdown
Contributor

@kim-em kim-em commented Feb 1, 2026

This PR adds pre-filtering to lake exe cache get: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

Performance (tested on Mac Studio with SSD, ~7900 cached files):

Scenario Before After
All files already unpacked ~25s ~5s
Partial decompression needed ~25s ~5s + decompression time

Timing breakdown (after this PR):

  • Hash computation: ~2.8s
  • Filter to existing ltar files: ~0.06s
  • Read ltar headers + trace files, compare hashes: ~2.5s
  • leantar (if needed): 0s when all files match

What changed:
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's depHash field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code

Previously `lake exe cache get` would pass all cached ltar files to
leantar for decompression, even when the files were already unpacked.
Leantar would check each trace file and skip extraction, but this still
required opening and parsing thousands of files.

This PR adds pre-filtering in the cache script: before calling leantar,
we check if each module's trace file exists and has a matching depHash.
We compare the Lake depHash from the ltar file header (not the mathlib
cache hash) with the trace file's depHash field.

Modules that are already correctly unpacked are filtered out, so leantar
only processes files that actually need decompression.

Performance improvement:
- Before: ~25 seconds for "No files to download" case
- After: ~5 seconds for same case

Fixes issue discussed at:
https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 1, 2026

PR summary 5cb27be88d

Import changes for modified files

No significant changes to the import graph

Import changes for all files
Files Import difference

Declarations diff

+ Char.hexDigitToNat?
+ ModuleHashMap.filterNeedsDecompression
+ String.parseHexToUInt64?
+ getTracePath
+ needsDecompression
+ readLtarHash
+ readTraceHash

You can run this locally as follows
## summary with just the declaration names:
./scripts/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/declarations_diff.sh long <optional_commit>

The doc-module for script/declarations_diff.sh contains some details about this script.


No changes to technical debt.

You can run this locally as

./scripts/technical-debt-metrics.sh pr_summary
  • The relative value is the weighted sum of the differences with weight given by the inverse of the current value of the statistic.
  • The absolute value is the relative value divided by the total sum of the inverses of the current values (i.e. the weighted average of the differences).

@MichaelStollBayreuth
Copy link
Copy Markdown
Contributor

In my view, this is a much appreciated QoL improvement. Unfortunately, reviewing it is above my paygrade...

@joneugster joneugster added the t-meta Tactics, attributes or user commands label Feb 1, 2026
@joneugster
Copy link
Copy Markdown
Contributor

joneugster commented Feb 1, 2026

I could review this next Friday, too, if nobody else beats me to it.

Btw, please label PRs about Cache with either t-meta or ci as they might be missed without topic label. (this will be done automatically in #34677)

@grunweg grunweg added the CI Modifies the continuous integration setup or other automation label Feb 1, 2026
@grunweg grunweg assigned grunweg and joneugster and unassigned grunweg Feb 1, 2026
Copy link
Copy Markdown
Contributor

@joneugster joneugster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

Since my suggestions are only about wording and moving the helper functions around, I suggest

maintainer delegate

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 6, 2026

🚀 Pull request has been placed on the maintainer queue by joneugster.

@mathlib-triage mathlib-triage bot added the maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. label Feb 6, 2026
- Move hex parsing functions to Cache/Lean.lean near related hex utilities
- Rename to Char.hexDigitToNat? and String.parseHexToUInt64? following conventions
- Make parseHexToUInt64? pure (Option instead of IO)
- Use "decompressed" consistently instead of mixing with "unpacked"
- Improve message when all files already decompressed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
kim-em added a commit to kim-em/mathlib4 that referenced this pull request Feb 9, 2026
…comments

Merge skip-already-unpacked-cache (leanprover-community#34667) into pipeline-cache-decompression (leanprover-community#32987) and:

- Remove duplicate hex parsing functions, use shared Char.hexDigitToNat? and
  String.parseHexToUInt64? from Cache/Lean.lean
- Use FilePath instead of String for hashFromFileName and mathlibDepPath
  (call .toString only when needed for JSON)
- Add docstring noting harvestDecompTask returns (successful, failed) tuple
- Fix typo: print stdout not stderr for lake stdout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Comment on lines +414 to +416
let some handle ← try
some <$> IO.FS.Handle.mk ltarPath .read
catch _ => pure none | return none
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as

Suggested change
let some handle ← try
some <$> IO.FS.Handle.mk ltarPath .read
catch _ => pure none | return none
let handle ← try IO.FS.Handle.mk ltarPath .read
catch _ => return none

?

@joneugster
Copy link
Copy Markdown
Contributor

joneugster commented Feb 13, 2026

Since it's only one remaining stylistic suggestion above I'll ping this again:

maintainer delegate

@github-actions
Copy link
Copy Markdown

🚀 Pull request has been placed on the maintainer queue by joneugster.

@adomani
Copy link
Copy Markdown
Contributor

adomani commented Feb 15, 2026

Let's try and see how it works!

bors merge

@mathlib-triage mathlib-triage bot added ready-to-merge This PR has been sent to bors. and removed maintainer-merge A reviewer has approved the changed; awaiting maintainer approval. labels Feb 15, 2026
mathlib-bors bot pushed a commit that referenced this pull request Feb 15, 2026
This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

**Performance (tested on Mac Studio with SSD, ~7900 cached files):**

| Scenario | Before | After |
|----------|--------|-------|
| All files already unpacked | ~25s | ~5s |
| Partial decompression needed | ~25s | ~5s + decompression time |

**Timing breakdown (after this PR):**
- Hash computation: ~2.8s
- Filter to existing ltar files: ~0.06s  
- Read ltar headers + trace files, compare hashes: ~2.5s
- leantar (if needed): 0s when all files match

**What changed:**
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Feb 15, 2026

Build failed:

@bryangingechen
Copy link
Copy Markdown
Contributor

There's an issue with the changes in #34847 which is making it difficult to merge PRs that don't cause new cache files to get uploaded. I've already pinged @marcelolynch; in the meantime we can try to merge this in a batch with some other PRs that do result in uploads and it should go through.

@bryangingechen
Copy link
Copy Markdown
Contributor

bors r+

mathlib-bors bot pushed a commit that referenced this pull request Feb 16, 2026
This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

**Performance (tested on Mac Studio with SSD, ~7900 cached files):**

| Scenario | Before | After |
|----------|--------|-------|
| All files already unpacked | ~25s | ~5s |
| Partial decompression needed | ~25s | ~5s + decompression time |

**Timing breakdown (after this PR):**
- Hash computation: ~2.8s
- Filter to existing ltar files: ~0.06s  
- Read ltar headers + trace files, compare hashes: ~2.5s
- leantar (if needed): 0s when all files match

**What changed:**
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code
@mathlib-bors
Copy link
Copy Markdown
Contributor

mathlib-bors bot commented Feb 16, 2026

Pull request successfully merged into master.

Build succeeded:

@mathlib-bors mathlib-bors bot changed the title perf(Cache): skip decompression for already-unpacked files [Merged by Bors] - perf(Cache): skip decompression for already-unpacked files Feb 16, 2026
@mathlib-bors mathlib-bors bot closed this Feb 16, 2026
mathlib-bors bot pushed a commit that referenced this pull request Feb 16, 2026
…cache is skipped (#35389)

Currently some PRs sent to `bors` have ended up failing (#35330, #34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped.

Co-authored-by: Bryan Gin-ge Chen <[email protected]>
kim-em pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Feb 17, 2026
…cache is skipped (leanprover-community#35389)

Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped.

Co-authored-by: Bryan Gin-ge Chen <[email protected]>
kim-em pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Feb 17, 2026
…cache is skipped (leanprover-community#35389)

Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped.

Co-authored-by: Bryan Gin-ge Chen <[email protected]>
Maldooor pushed a commit to Maldooor/mathlib4 that referenced this pull request Feb 25, 2026
…r-community#34667)

This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

**Performance (tested on Mac Studio with SSD, ~7900 cached files):**

| Scenario | Before | After |
|----------|--------|-------|
| All files already unpacked | ~25s | ~5s |
| Partial decompression needed | ~25s | ~5s + decompression time |

**Timing breakdown (after this PR):**
- Hash computation: ~2.8s
- Filter to existing ltar files: ~0.06s  
- Read ltar headers + trace files, compare hashes: ~2.5s
- leantar (if needed): 0s when all files match

**What changed:**
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code
Maldooor pushed a commit to Maldooor/mathlib4 that referenced this pull request Feb 25, 2026
…cache is skipped (leanprover-community#35389)

Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped.

Co-authored-by: Bryan Gin-ge Chen <[email protected]>
pfaffelh pushed a commit to pfaffelh/mathlib4 that referenced this pull request Mar 2, 2026
…r-community#34667)

This PR adds pre-filtering to `lake exe cache get`: before calling leantar, we check if each module's trace file exists and has a matching Lake depHash. Modules that are already correctly unpacked are skipped entirely.

**Performance (tested on Mac Studio with SSD, ~7900 cached files):**

| Scenario | Before | After |
|----------|--------|-------|
| All files already unpacked | ~25s | ~5s |
| Partial decompression needed | ~25s | ~5s + decompression time |

**Timing breakdown (after this PR):**
- Hash computation: ~2.8s
- Filter to existing ltar files: ~0.06s  
- Read ltar headers + trace files, compare hashes: ~2.5s
- leantar (if needed): 0s when all files match

**What changed:**
Before calling leantar, we now read the Lake depHash from each ltar file header (first 12 bytes) and compare with the trace file's `depHash` field. Files with matching hashes are filtered out.

Note: The mathlib cache hash (ltar filename) is different from the Lake depHash (stored in ltar header and trace file). This PR compares the Lake depHashes, which is what leantar uses internally.

Fixes https://leanprover.zulipchat.com/#narrow/channel/536994-ecosystem-infrastructure/topic/lake.20exe.20cache.20get.20always.20decompresses

🤖 Prepared with Claude Code
pfaffelh pushed a commit to pfaffelh/mathlib4 that referenced this pull request Mar 2, 2026
…cache is skipped (leanprover-community#35389)

Currently some PRs sent to `bors` have ended up failing (leanprover-community#35330, leanprover-community#34667) because the upload_cache step was skipped, causing the required Post-Build and Post-CI jobs to get skipped as well. This PR adjusts the conditionals to ensure that those steps still run even if the cache upload gets skipped.

Co-authored-by: Bryan Gin-ge Chen <[email protected]>
mathlib-bors bot pushed a commit that referenced this pull request Mar 6, 2026
This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: #34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 8, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 9, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
seewoo5 pushed a commit to seewoo5/mathlib4 that referenced this pull request Mar 9, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
xroblot pushed a commit to xroblot/mathlib4 that referenced this pull request Mar 10, 2026
…-community#32987)

This PR modifies `lake exe cache get` to decompress files as they download, rather than waiting for all downloads to complete first.

Previously the cache system had two sequential phases: download all files using `curl --parallel`, then decompress all files using a single `leantar` call. Now a background task spawns sequential batched `leantar` calls to decompress files as downloads complete, pipelining network I/O and disk I/O.

🤖 Prepared with Claude Code

- [x] depends on: leanprover-community#34667
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Modifies the continuous integration setup or other automation ready-to-merge This PR has been sent to bors. t-meta Tactics, attributes or user commands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants