improve performance of clingo program generation

I have been looking at #51245 (also see #51338). Despite the special case of the infinite looping for certain situations of conditional variants, package solving generally takes an extremely short time (< 2 seconds of the 50-second run profiled below). Instead, our issue is that we have to load and execute a vast amount of python code to get to the solver.

I found enabling the concretization cache did reduce the runtime significantly (50 => 33 seconds), but it did not significantly change the structure of the profile graph. As we will see, it is completely orthogonal to the changes I'm proposing and I believe the two methodologies will work very well together. Both profiles are shown below.

# Highlights

The results from the bigger graphs (the first one, which is without the concretization cache, and the last one, with the cache, at the bottom) are of forensic utility here. Both the cached and uncached profile outputs are useful for different parts. Playing with `--node-thres` from `gprof2dot` is very useful.

Starting from the simplest ones:

1. `isinstance` is very hot (~6%, 4-5 seconds regardless of the concretization cache).
    - This is about 1/3rd from `core:_id` and 2/3rds from different parts of `version_types.py`.
    - This can probably be optimized pretty readily into `type` equality checking.
2. spec parsing for versions has a very mild hotspot (0.5%, 1.2 seconds).

Some the difficulty in attributing a specific slowdown comes from laziness, which is a general category of issue:

3. spec hashing and equality is lazy.
    - hard to attribute a direct number, but `sorted()` is (10%, 5 seconds). guessing this is ~20%.
    - the hash should almost definitely be cacheable, even for abstract specs. we may need to produce some abstraction to invalidate the cached hash upon modification.
    - failing that, we'll need to do what I did with the python `packaging` library and pip more generally, by separating mutable and immutable specs.
    - this also relates to stringification, see (5).
4. duplicating a spec (from `__init__()`) performs an eager deep copy (2.8%, 1.4 seconds).
    - this ends up recursively calling `__init__()`.
    - this is worthy of note because it rules out structurally sharing cached results from hashing and stringification, see (5).
5. spec stringification is extremely hot (16%, 7.9 seconds).
    - in particular since most dependencies are shared we should be able to make this much faster.
    - `format_attribute()` is a particular hot spot, taking (4%, 2 seconds) at its own top level. this is also attributable to versions in particular (see (2)).

These are the more involved issues, but they too are solvable:

6. `is_allowed_on_this_platform()` is (12%, 6 seconds) despite being memoized in-memory.
    - i'm not sure of the guarantees we provide for `~/.spack/cache`, but we could key this by host platform and cpu.
    - i think we should also be able to share cache results to some degree across packages, but calculating whether the requirement applies is not the problem (see (7)).
7. **we should never have to calculate *any* package compatibility at all if the package definitions haven't changed since the last run.**
    - in particular, we shouldn't need to import any `package.py` until we get to the fetch/configure/build/install part.
    - more on this in the next section.

## Architectural discussion

Spack performs at least two distinct classes of task with `package.py` files:
1. specify compatibility with other packages as input to the concretizer.
2. specify how to fetch, configure, build, install the package.

It is unparalleled at both. One impressive thing we have achieved and continue to maintain is the correspondence between the clingo solver and the python-level `spack.spec.Spec` class, so that package writers don't need to be experts in clingo to get the job done. But in using clingo for logic, we have also created a serialization boundary -- **and serialization boundaries can be cached**. This is the discussion I'd like to open up.

### Background: pip analogy

As background, I've done some work on pip in this regard: https://github.com/pypa/pip/issues/12921. That workstream revolves around metadata in general, but the most appropriate analogy is https://github.com/pypa/pip/pull/12258 (which corresponds very closely to our `config:concretization_cache:enabled:true`). I have more work on pip in another branch which fully separates package finding from evaluating compatibility for a given python platform.

Like spack, pip also currently intersperses fetching the description of a package with evaluating its compatibility for a given input. In pip's case, the package definitions are stored remotely, so require a network call (which can still be cached: https://github.com/pypa/pip/pull/12257). Spack instead has the problem of maintaining all of pypi locally. However, *unlike* pypi, we have a much richer representation of dependencies that allows and encourages package authors to deduplicate constraints. This is why it's still at all feasible to run spack without a remote server, which is an extremely impressive feat. 

With pip, the network request and parsing can be cached, and then there are two variants of additional improvement:
1. **filtering by python platform:** each instance of a package can be yes/no filtered against the python platform, which can be cached (see (6)).
2. **asynchronous subprocesses:** the data for each package name can be loaded in parallel.

I'll finally sketch out what we can do for spack.

### Implementation

In essence, we want to serialize the result of evaluating all the specs attached to a package, then retrieve the results of that instead of calculating it afresh upon each spack run. The existing concretization cache is very coarse-grained, and it does not cache the problem generation step. **This means that the `pkg_rules()` and `possible_dependencies()` calculations take the exact same amount of time both with and without the concretization cache.**

#### Less good: checksum all the things all the time

There is an immediate idea that comes to mind, which is attempting to take a checksum of the entire package repo at once, because if nothing at all has changed, then you could always return a cached serialization. But the same problem arises here as with pypi, in that most packages are not involved in most solves, and you would necessarily need to eagerly compute a checksum for every single package to make this work at all. You could use parallelism for this, but ~9000 checksums is still no laughing matter, and has difficult UX concerns of its own (open fd count, intermittent filesystem errors).

I'll note that the canonical way build systems like pants solve this sort of problem is by using a persistent daemon process with a file watcher, that then recomputes a chain of outputs upon file changes. This would actually work, but having implemented such a daemon (twice) (in two programming languages) (with much less stringent portability requirements than spack has), I can comfortably say it is to be avoided if at all possible.

We could avoid some of the compatibility issues by creating an IPC protocol for the daemon, but it would end up codifying most of the same goals as the next approach, which also does not require radically rearchitecting package finding.

#### Plausible: lazily checksum, lazily expand

This approach leverages a few assumptions:
1. A `package.py` file is the single entry point describing a package's dependencies on other packages.
    - A `when(...)` directive can make transitive queries, but those don't add new dependencies to a dependency.
2. A `package.py` file's module contents upon import are described fully by its checksum.
    - So no importing other local files.
3. The contents of a `package.py` do not change over the course of a spack invocation.
    - So we can import them in any order, and we can trust a checksum not to change.
4. Dependencies can be added only by spec roots (e.g. from the cli) and transitive dependencies of those roots.

The idea here is to expand our concept of roots from the cli with roots from the package repository (checksummed `package.py` files). As we build up the tree of dependencies to recurse through from the cli roots in one direction, we also build up a tree from the repo roots in the other direction, which can be persisted to disk on a per-package level.

If our clingo logic represented the entirety of our `spec.satisfies()` logic, then we wouldn't need to build up such a tree, and could instead cache all the rules generated by a `package.py` based upon the `package.py`'s checksum alone. And after reviewing the code, it seems it mostly does, which would make this easier than expected.

The main thing we should essentially never need to do across concretizer runs where the universe of `package.py` files don't change is that we should never need to import a class or generate package-specific rules, since those won't change across runs. The cache "tree" is only necessary for intersections of two packages. It seems that the trigger and effect rules codify what those intersections might be, and are also a very small portion of the runtime.

##### Summary of next steps
I need to do more research on how packages get converted into rules. Here are the things I think are immediately viable:

- Individual spec string and hash caching as first described in the profile analysis.
- Caching `possible_dependencies()` (for `possible_virtuals()`) should be easier to cache than most and will play into the existing `input_analysis` paradigm.
- Caching `define_version_constraints()` will be an important first step towards caching the rest.

Oh, and you can use parallelism, particularly since the asp program is already delimited by headers. This could be worth doing on its own separate from actually caching the output.

# Profile repro steps

This corresponds to a normal `spack spec` run that took 50 seconds despite having no changed state compared to the same command run immediately beforehand. This is not to express frustration but just to explain why I think this should be a solvable problem.

`speedscope.py` (with speedscope disabled since it's not useful here and takes a while to generate):
```python
# import speedscope

from spack.concretize import concretize_one
from spack.spec import Spec

# with speedscope.track('spack.py-speedscope.json'):
#     concretize_one(Spec('icu4c cxxstd=17'))

print(concretize_one(Spec('icu4c cxxstd=17')))
```

The command line (profile requires `gprof2dot`, `dot`, and `xdot`):
```shell
# Execute spack:
; PYTHONPROFILEIMPORTTIME=1 PYTHONMALLOC=mimalloc PYTHON_GIL=0 PYTHONOPTIMIZE=1 PYTHONPATH="$(pwd)/lib/spack:${PYTHONPATH}" \
  spack -d python -m cProfile -o cprof.pstats ./speedscope.py \
  &>import-time.txt
# Generate graph drawing which demonstrates hot functions:
; gprof2dot -f pstats cprof.pstats --show-samples --color-nodes-by-selftime --skew=0.005 --node-thres=1 --edge-thres=0.1 --node-label={self,total}-time{,-percentage} --time-format='%.3f sec' --root='speedscope:*:<module>' \
  | tee cprof.dot | tee >(dot -Ln10 -Tpng -s200 -o cprof.png) \
  | dot -Txdot -Ln30 -s200 -o cprof.xdot && 2>/dev/null xdot --no-filter --hide-toolbar cprof.xdot
```

This profiled run produced two output files, `import-time.txt` and `cprof.png`, reproduced below:

[`import-time.txt`](https://github.com/user-attachments/files/22527125/import-time.txt)

<details>
<summary>hidden</summary>
<img width="2968" height="5907" alt="Image" src="https://github.com/user-attachments/assets/b3369075-6de7-4037-a432-a71a8557af4c" />

Here is another `cprof.png`, generated with `--node-thres=5`:

<img width="1563" height="3533" alt="Image" src="https://github.com/user-attachments/assets/b2f5529b-9da8-49b8-ad38-3fe9be612885" />

After setting `config:concretization_cache:enable:true` (33 seconds), I produced the below graph drawing, also with `--node-thres=5`:

<img width="1630" height="4096" alt="Image" src="https://github.com/user-attachments/assets/5507127f-7a89-4d78-bf93-2794ba8f8559" />

Here's with `--node-thres=1` and concretization cache enabled (this is probably the one worth staring at the longest):

<img width="3466" height="6469" alt="Image" src="https://github.com/user-attachments/assets/a5b59203-aa53-445f-9b91-42e5e70ebca5" />
</details>

## Feature branch

This was technically made on the branch [`concretize-inputs-faster`](https://github.com/spack/spack/compare/develop...cosmicexplorer:spack:concretize-inputs-faster?expand=1), but that's not important at the moment.

### Discussion of parallelism

The branch could become important if we employ parallelism, since the event loop could be a useful way to wait upon parallel inputs. But the main theory of parallelism here would just be the ability to avoid serially blocking on file imports, which just requires `multiprocessing` and not a more granular event loop.

If `--disable-gil` gets much further along, then we could conceivably be able to avoid the cache proposed above in favor of multithreaded imports. But I don't believe that's likely to work anytime soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of clingo program generation #51365

Highlights

Architectural discussion

Background: pip analogy

Implementation

Less good: checksum all the things all the time

Plausible: lazily checksum, lazily expand

Summary of next steps

Profile repro steps

Feature branch

Discussion of parallelism

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

improve performance of clingo program generation #51365

Description

Highlights

Architectural discussion

Background: pip analogy

Implementation

Less good: checksum all the things all the time

Plausible: lazily checksum, lazily expand

Summary of next steps

Profile repro steps

Feature branch

Discussion of parallelism

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions