Skip to content

Solver: Cache Concretization Results#48198

Merged
kwryankrattiger merged 37 commits intospack:developfrom
johnwparent:concretization-caching
Feb 28, 2025
Merged

Solver: Cache Concretization Results#48198
kwryankrattiger merged 37 commits intospack:developfrom
johnwparent:concretization-caching

Conversation

@johnwparent
Copy link
Copy Markdown
Contributor

@johnwparent johnwparent commented Dec 18, 2024

Spack Concretization Cache

This PR introduces an initial attempt to cache concretization results. I'm sure the design/function can be improved, this is PR is intended to be an initial implementation to be built from.

This PR:

  • Adds two config options, one to provide a location for the concretization cache on disk (default is ~/.spack/concretization), the other assigns a limit to how many cache entries we allow (default is 1k)

  • Adds the concretization cache to spack clean as a standalone clean operation as well as part of spack clean -a

  • Will prune the cache during each solver run to ensure the cache does not grow unchecked

    • Cache entries are pruned based on age of the entry, this is subject to change, input encouraged.
    • Cache entries are allowed to accumulate to max concretizer cache entries, and once that limit is reached, the oldest 10% are pruned automatically.
  • Makes a number of changes to SpackSolverSetup and related classes to ensure output of solver setup is deterministic

    • Adds a test to validate this

Caching approach:

Cache hashes are the sha256 hash of the string representation of:

  • All .lp asp files involved in a given concretization run
  • The output of SpackSolverSetup.setup split by newline, sorted, and rejoined into a single string

This hash determines if a cache hit has occurred.
Cache entries are stored as a file named by the given hash under a directory with a name based on the first two characters of the cache hash (similar to the source cache).

The contents of a cache entry are spack.solver.asp.Result objects serialized to disk via json, when cache hits are encountered, these result json representations are read from disk and recomposed into spack.solver.asp.Result objects, allowing the reproduction of a solver run exactly as if it had just occurred with no loss of context or statistics.

Benchmarks

3 specs were selected for bench-marking, readline, hdf5+mpi build_system=cmake and paraview+mpi+python.
These packages were selected as they represent a spectrum of DAG sizes, and range from simple to including virtuals, and splicing.
Benchmarks were performed on an M1 Mac

readline

Without a cache:

    setup          3.518s
    cache-check     0.041s
    ordering       0.034s
    load           0.648s
    ground         2.362s
    solve          1.174s
    construct_specs     0.329s
    total          8.093s

With a cache:

    setup          3.459s
    cache-check     0.041s
    ordering       0.034s
    total          3.510s

hdf5+mpi build_system=cmake

Without a cache:

    setup          3.523s
    cache-check     0.041s
    ordering       0.035s
    load           0.642s
    ground         2.401s
    solve          1.510s
    construct_specs     0.426s
    total          8.557s

With a cache:

    setup          3.523s
    cache-check     0.063s
    ordering       0.034s
    total          3.595s

paraview+mpi+python

Without a cache:

    setup          7.983s
    cache-check     0.091s
    ordering       0.076s
    load           1.473s
    ground         5.570s
    solve          5.335s
    construct_specs     1.104s
    total         21.566s

With a cache:

    setup         7.985ms
    cache-check  0.121s
    ordering     0.0898s
    total      8.196s

@spackbot-app spackbot-app bot added commands core PR affects Spack core functionality shell-support tests General test capability(ies) labels Dec 18, 2024
@johnwparent johnwparent force-pushed the concretization-caching branch from 33c066d to 193bd00 Compare December 18, 2024 21:49
@johnwparent
Copy link
Copy Markdown
Contributor Author

cc @tgamblin @becker33

@johnwparent
Copy link
Copy Markdown
Contributor Author

johnwparent commented Dec 19, 2024

Addressing the import check failure will require moving around some additional blocks of code into new modules, pushing up a pass at that now, but very open to alternative suggestions

I've had to move significantly more code than planned. The spack.solver module either needs a substantive re-org or the concretization cache should be left out of the spack.caches module, I'll hold on general consensus on how we want to handle that.

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 19, 2024

concretization cache should be left out of the spack.caches module

Go for it

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 19, 2024

I only quickly glanced at this, but what I've seen does not look process safe.

You could do file locking per cache entry / dir, but probably easier is to do atomic adds (write to safe tempfile in target dir and os.rename atomically) and reads (just an open call), minimize tree traversal, and assume that any time you traverse most operations can raise.

Also write in tell don't ask style, any if is_dir test may not be the case anymore the next line.

@johnwparent
Copy link
Copy Markdown
Contributor Author

johnwparent commented Dec 20, 2024

I only quickly glanced at this, but what I've seen does not look process safe.

You're right, Windows is single-spack only at the moment, so I often forget this, will address, thanks.

@johnwparent johnwparent force-pushed the concretization-caching branch from ef537ee to e5003fc Compare January 8, 2025 20:17
@johnwparent johnwparent force-pushed the concretization-caching branch 2 times, most recently from 9b4f1a8 to 1a77c8a Compare January 16, 2025 23:53
@johnwparent
Copy link
Copy Markdown
Contributor Author

Looks like this still needs to handle splicing a bit better, will investigate further monday

@johnwparent
Copy link
Copy Markdown
Contributor Author

Gitlab failures are currently rouge pipeline failures.

@spackbot-app spackbot-app bot added the documentation Improvements or additions to documentation label Feb 27, 2025
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 28, 2025

Did anybody try any performance benchmark on this PR? Otherwise I can try to compare with develop on a cold and hot cache.

@kwryankrattiger
Copy link
Copy Markdown
Contributor

@johnwparent had shared some numbers previously.

I agree it would be nice to have a record of that in the PR too.

Copy link
Copy Markdown
Contributor

@kwryankrattiger kwryankrattiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving now, will merge after performance numbers are posted.

@johnwparent
Copy link
Copy Markdown
Contributor Author

@kwryankrattiger @alalazo benchmarks have been added to the PR description

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 28, 2025

Do we have a lockfile per cache entry:

$ ls -larth ~/.spack/cache/concretization/12
total 1,1M
-rwxrwxr-x 1 culpo culpo    0 feb 28 18:36 .128f2dc2834e4e295aca1f50be39c5003340cbcaaff2db8e94ee565d92c3bea1.lock
drwxrwxr-x 3 culpo culpo 4,0K feb 28 18:36 ..
-rw-rw-r-- 1 culpo culpo 1,1M feb 28 18:36 128f2dc2834e4e295aca1f50be39c5003340cbcaaff2db8e94ee565d92c3bea1
drwxrwxr-x 2 culpo culpo 4,0K feb 28 18:36 .

?

@johnwparent
Copy link
Copy Markdown
Contributor Author

johnwparent commented Feb 28, 2025

Do we have a lockfile per cache entry:

$ ls -larth ~/.spack/cache/concretization/12
total 1,1M
-rwxrwxr-x 1 culpo culpo    0 feb 28 18:36 .128f2dc2834e4e295aca1f50be39c5003340cbcaaff2db8e94ee565d92c3bea1.lock
drwxrwxr-x 3 culpo culpo 4,0K feb 28 18:36 ..
-rw-rw-r-- 1 culpo culpo 1,1M feb 28 18:36 128f2dc2834e4e295aca1f50be39c5003340cbcaaff2db8e94ee565d92c3bea1
drwxrwxr-x 2 culpo culpo 4,0K feb 28 18:36 .

?

Yes, each entry is locked individually as per the behavior of the underlying FileCache used to manage the cache entries.
We get two benefits with this approach, the entry locks are more available than if they were managed by a single, central lock, and using FileCache allows us to switch out the caching backend in the future to allow for remote caches.

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 28, 2025

I guess we can probably optimize in a later PR to use a single lock file per the entire cache, like we do for store and stages (and lock different bytes based on the entry, iirc).

@kwryankrattiger kwryankrattiger merged commit d234df6 into spack:develop Feb 28, 2025
36 of 37 checks passed
mtaillefumier pushed a commit to mtaillefumier/spack that referenced this pull request Mar 2, 2025
Concretizer caching for reusing solver results
white238 pushed a commit that referenced this pull request Mar 3, 2025
Concretizer caching for reusing solver results
@haampie
Copy link
Copy Markdown
Member

haampie commented Mar 11, 2025

@kwryankrattiger why did you merge this? Did you try it out? Please check with other reviewers before merging.

I would suggest to disable this feature, because getting cache hits reliably for even the simplest spec is impossible:

concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at 18/18607911a1944d36e1a49f3728783e0a957a0a21c390138f22044afc1715ba4e
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at 28/286c42d84b58dbbc59d6e9197fb31560b1fafe6d1044a69b04f546e4f6e02f6d
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache hit at 28/286c42d84b58dbbc59d6e9197fb31560b1fafe6d1044a69b04f546e4f6e02f6d
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache hit at 18/18607911a1944d36e1a49f3728783e0a957a0a21c390138f22044afc1715ba4e
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at de/de4b86777b02dbba18b370a56ce6d40213ab03d775432dd888909622b9687cff
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at dc/dcdd9c18c0480908ae5676b4ec026a091f5c999e4d1f1bb366c0dd4e990f3f79
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at 4f/4f998c4b067724eab7de84ebcc43ba2bb3450648819d75fe0a14b816e7d24a0e
concretization $ spack -d  spec zlib |& grep -E '(hit|miss)'
==> Concretization cache miss at 43/4301e1911df4c87eaab53d3991c22986c8d64948ce6129564aaed2cfad570028

2 of 7 hits, that's really bad.

Also don't understand why the config option was put under config:concretization_cache instead of concretizer:cache. Also why is config:concretization_cache:url the key if it's a path, it doesn't accept arbitrary urls and does not even accept file urls.

@kwryankrattiger
Copy link
Copy Markdown
Contributor

I had tried it out and we had discussed this PR in another meeting with @tgamblin and @alalazo and it was fine to merge.

The issues causing non determinism were found post merge. The issues are known and being resolved by @johnwparent now. It isn't breaking workflows so no one jumped to revert.

@kwryankrattiger
Copy link
Copy Markdown
Contributor

kwryankrattiger commented Mar 11, 2025

The key is URL is because it is going to accept URLs in the near term for CI sharing of the cache.

The location of the config was chosen to match the other misc cache configs. We don't put mirror misc cache configs in mirrors.yaml for example.

@haampie
Copy link
Copy Markdown
Member

haampie commented Mar 11, 2025

We don't put mirror misc cache configs in mirrors.yaml for example.

Not the best example given that misc_cache is a combination of repo metadata, compiler output and build cache indices.

I still think it's better to put it in concretizer given it has various sub-options.

haampie added a commit that referenced this pull request Mar 19, 2025
haampie added a commit that referenced this pull request Mar 19, 2025
tgamblin pushed a commit that referenced this pull request Jun 6, 2025
#48198 Introduces the concretization cache, and associated manifest. However,
it only added testing for the cache itself, not the manifest or cleanup procedures.

This PR:

- [x] Adds tests for the manifest and cleanup
- [x] Refactors the manifest to update correctly
- [x] Refactors the cache to store base32 hashes like the rest of Spack's hashes
- [x] gzips the cache entries for a smaller memory footprint

`FileCache` class was refactored into an abstract base `Cache` class which provides
the general cache read/write transaction interface, and a `FileCache` and
`DirectoryCache` child classes, `FileCache` which should be functionally identical to
the `FileCache` class pre refactor, deals with caches at the individual file level, and
`DirectoryCache`, which is mean to provide a directory level or "bucket" caching level.
tgamblin added a commit that referenced this pull request Oct 18, 2025
Third try to enable concretization caching in `develop`. Past attempts were reverted due to cache
management issues and race conditions.

See #48198 for the introduction of the concretization cache, manifest, etc. That PR only
added testing for the cache itself, not the manifest or cleanup procedures.

This PR:

- [x] Adds tests for cleanup
- [x] Refactors the cache to store base32 hashes like the rest of Spack's hashes
- [x] gzips cache entries for a smaller memory footprint
- [x] Removes redundant stat calls 
- [x] Less aggressive/error prone cleanup phase
- [x] Adds an entry limit to the cache, removes the size limit
- [x] Rewrite pruning to use LRU instead of FIFO
- [x] Canonicalizes and strips `clingo` input before hashing, for greater hit rate

This PR simplifies the locking around the concretization cache. Entries are stored by
the base32 hash of their inputs, and there is a single lock file (as we do in many places in
spack) for all such entries. We use byte range locks to manage all the entries.

Cleanup has changed to be less expensive overall. There is no more size limit; there is only
an entry limit, and specs in the cache are compressed. Specs are ~35k, and the initial limit
is 1000 cache entries, or ~35 MB.  This limit is customizable in configuration. To check if the
cache needs cleanup, we need only list files in the cache directory. To clean up, we stat the files
and clean up the least recently used concretizations.  We remove half the cache at a time so that
we do not pay for cleanup on every concretization when we're close to the cache limit.

The concretization cache is stored in the `misc_cache` by default, so `spack clean -m` is all
that is needed to get rid of it.  The location of the concretization cache can be specified
independently (with a `url` config option) but we will document and expand on that use later
as we integrate it in CI.

---------

Signed-off-by: John Parent <[email protected]>
Signed-off-by: Todd Gamblin <[email protected]>
Co-authored-by: Todd Gamblin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands core PR affects Spack core functionality documentation Improvements or additions to documentation shell-support tests General test capability(ies) utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants