Skip to content

Experimental ASP-based concretizer#19501

Merged
tgamblin merged 105 commits intodevelopfrom
features/solver-rebased
Nov 17, 2020
Merged

Experimental ASP-based concretizer#19501
tgamblin merged 105 commits intodevelopfrom
features/solver-rebased

Conversation

@alalazo
Copy link
Copy Markdown
Member

@alalazo alalazo commented Oct 23, 2020

This adds an experimental concretizer to Spack that makes use of Answer Set Programming. The new solver uses clingo, the combined grounder (gringo) and solver (clasp) from Potassco, instead of attempting to do the entire solve in Python as we did before.

Problem

The current concretizer is not exhaustive and is not guaranteed to find a solution if one exists. It is greedy, does not do backtracking, and as a result the errors it gives can be confusing. In many cases it can resolve a constraint too early -- it assigns final values to nodes as it builds the graph, and it may assign a version, variant, or other value to a node without considering the entire problem. This can result in the concretizer finding conflicts where solutions exist.

Extending the current concretizer is difficult. There are many moving pieces, and rewriting any of them can be a lot of work. Making major changes to concretization semantics is difficult, and increasingly we're unable to add features that users need.

The concretization problem that Spack solves is NP-complete (as is regular old dependency resolution with just versions), so we have decided to start using more powerful solvers to address it. Using Clingo is a step in this direction.

How it works

Answer Set programming looks like Prolog but has the advantage that, rather than trying to be a full programming language, it reduces logic problems to CDCL SAT with optimization. ASP has the nice property that solves will always terminate (unlike Prolog programs), and it follows the stable model semantics, which makes it easier to model some of the default-driven behavior in Spack. The constraints in Spack are essentially SAT clauses, and our preferences (like packages.yaml) add optimization weights to this. We don't just want a solution -- we want the best one, with preferred variants, compilers, etc.

The prior concretizer iteratively expanded nodes in the graph and attempted to unify them greedily as they were expanded. It used a fixed-point algorithm; iterating and expanding optional dependencies, replacing virtuals with providers, setting variants, versions, and compilers, etc. until there were no more changes to the DAG.

The new concretizer is less dynamic. It works like this:

  1. Look at the input specs and read in all possible dependencies
  2. Translate constraints in Spack's package DSL to rules and facts in ASP. e.g., zlib has some facts like this:
    version_declared("zlib","1.2.11",0).
    version_declared("zlib","1.2.8",1).
    version_declared("zlib","1.2.3",2).
    
    variant("zlib","optimize").
    variant_single_value("zlib","optimize").
    variant_default_value_from_package_py("zlib","optimize","True").
    variant_possible_value("zlib","optimize","False").
    variant_possible_value("zlib","optimize","True").
    These come directly from version() and variant() directives in the zlib/package.py file.
  3. Combine these facts with the logic program in lib/spack/spack/solver/concretize.lp. This defines all the rules and optimization criteria used in concretization in Spack.
  4. Send the whole program off to Clingo
  5. Read back one the optimal stable model found by the solver, and build a concrete Spec from it.

One nice thing about this is that it's declarative; we can add extra constraints and features more easily now, without having to reengineer the entire algorithm. Another is that we can solve much more complex problems than we could before. Finally, it will be easier going forward to implement many of the features we have planned (compilers as dependencies, optimizing for installed packages, etc.) using a real solver.

Error reporting

We are using Clingo's support for finding unsatisfiable cores to generate error messages. If something fails to concretize, the solver will tell you why by reporting a set of facts and rules that cannot be resolved together. This can be very useful, as often it points directly to conflicts and can be used, with a little staring, to debug issues. In other cases the core can still be quite large, and thus hard to interpret. Cores are not guaranteed to be minimal by Clingo. Improving these messages and converting them to understandable sentences is on our list of follow-on activities.

Alternatives

There are more details on alternatives in this FOSDEM talk. So far, Clingo works well, but we are also investigating Z3, an SMT solver. Z3 is more widely used and has a richer API, but its input language is in some sense more primitive, in that it doesn't accept or automatically ground first-order logic like Clingo. We are investigating which solver will be easiest to include (vendor) with Spack so that we can use the new concretizer everywhere.

Using the new concretizer

There is a new section in config.yaml:

  # The concretization algorithm to use in Spack. Options are:
  #
  #   'original': Spack's original greedy, fixed-point concretizer. This
  #       algorithm can make decisions too early and will not backtrack
  #       sufficiently for many specs.
  #
  #   'clingo': Uses a logic solver under the hood to solve DAGs with full
  #       backtracking and optimization for user preferences.
  #
  # 'clingo' currently requires the clingo ASP solver to be installed and
  # built with python bindings. 'original' is built in.
  concretizer: original
  1. You'll need to set concretizer: to clingo to use clingo everywhere.
  2. You will also need to spack install clingo@master and get clingo on your PATH to use the new concretizer.

Even if the new concretizer is not enabled, you can use the spack solve command as you would normally use spack spec -- it will try to use the new concretizer. spack solve will eventually go away but for now we have two commands.

Bugs fixed

fixes #267
fixes #303
fixes #1781
fixes #2310
fixes #2632
fixes #3628
fixes #4635
fixes #5651
fixes #7339
fixes #8082
fixes #9025
fixes #9414
fixes #9696
fixes #9744
fixes #9753
fixes #9911
fixes #12431
fixes #19764

@adamjstewart
Copy link
Copy Markdown
Member

Will spack solve be merged into spack spec? I guess I don't see the need for both if there is an option in config.yaml to choose which algo to use.

@alalazo alalazo requested a review from tgamblin October 23, 2020 21:06
@tgamblin tgamblin changed the title Clingo based solver Experimental ASP-based concretizer Oct 23, 2020
@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Oct 23, 2020

Will spack solve be merged into spack spec? I guess I don't see the need for both if there is an option in config.yaml to choose which algo to use.

Eventually we'll replace the old solver with the new one, but we at least need the old one to bootstrap an install of clingo at this point. The plan is to put this into 0.16.0 and look at other options like Z3 for a full switchover in 0.17.0.

@alalazo alalazo force-pushed the features/solver-rebased branch 3 times, most recently from 6a7da1c to 9585d74 Compare October 27, 2020 07:17
@alalazo alalazo marked this pull request as ready for review October 27, 2020 08:24
@aweits

This comment has been minimized.

@alalazo

This comment has been minimized.

@aweits

This comment has been minimized.

@adamjstewart

This comment has been minimized.

@aweits

This comment has been minimized.

@alalazo

This comment has been minimized.

Comment on lines +836 to +847
for i, cspec in enumerate(reversed(matches)):
self.gen.fact(fn.node_compiler_preference(
pkg.name, cspec.name, cspec.version, i))
pkg.name, cspec.name, cspec.version, -i * 100
))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgamblin This is a bit of magic I had to do to make the concretizer pass a test on compiler preferences. The gist of it is that with multiple compilers we need to ensure that the weight of a preference for a single package is "negative enough" to be selected for minimization. Not sure this is the best formula though. Maybe:

- (i + 1) * 100

would be better as a weight, or maybe we should just change the target of minimization.

@alalazo

This comment has been minimized.

@alalazo
Copy link
Copy Markdown
Member Author

alalazo commented Oct 28, 2020

Dependencies that are conditional on another dependency should now be treated correctly. I added a simple test to check for that case as well.

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 28, 2020

Dependencies that are conditional on another dependency should now be treated correctly. I added a simple test to check for that case as well.

Just repulled and am testing a few things. Have we considered something like this BTW?

diff --git a/lib/spack/spack/solver/asp.py b/lib/spack/spack/solver/asp.py
index 691b0ac..c2f8c91 100644
--- a/lib/spack/spack/solver/asp.py
+++ b/lib/spack/spack/solver/asp.py
@@ -602,6 +602,7 @@ def solve(
         # Initialize the control object for the solver
         self.control = clingo.Control()
         self.control.configuration.solve.models = nmodels
+        self.control.configuration.solve.parallel_mode = spack.config.get('config:build_jobs')
         self.control.configuration.configuration = 'tweety'
         self.control.configuration.solver.opt_strategy = "bb,dec"

@aweits

This comment has been minimized.

@alalazo
Copy link
Copy Markdown
Member Author

alalazo commented Oct 28, 2020

@aweits No, we didn't discuss any optimization yet. I think it's worth adding even though right now the bottleneck is setting up the problem:

$ spack solve --timers hdf5~mpi
Time:
    setup:         31.0345
    load:          0.0040
    ground:        4.4527
    solve:         14.9748
Total: 50.7482

With 4 cores at it with parallel_mode:

$ spack solve --timers hdf5~mpi
Time:
    setup:         29.2326
    load:          0.0039
    ground:        4.2661
    solve:         5.8168
Total: 39.5874

My guess is that #19539 and package inspection are still the biggest targets to improve speed of the ASP solver.

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 28, 2020

@aweits No, we didn't discuss any optimization yet. I think it's worth adding even though right now the bottleneck is setting up the problem:

I see this:

single-thread:

[aweits@skl-a-00 spack]$ spack solve --timers py-tensorflow
Time:
    setup:         21.1258
    load:          0.0015
    ground:        3.5706
    solve:         339.2976
Total: 364.6276

16 threads:

[aweits@skl-a-00 spack]$ spack solve --timers py-tensorflow
Time:
    setup:         20.8383
    load:          0.0015
    ground:        3.6184
    solve:         28.3805
Total: 53.5030

@aweits

This comment has been minimized.

@alalazo

This comment has been minimized.

@alalazo alalazo force-pushed the features/solver-rebased branch from 391122b to 7bf3a49 Compare October 28, 2020 19:52
@bvanessen

This comment has been minimized.

@alalazo
Copy link
Copy Markdown
Member Author

alalazo commented Oct 29, 2020

@bvanessen Thanks for the report. This seems to point to the following conflicts in the opencv package:

# TODO For Cuda >= 10, make sure 'dynlink_nvcuvid.h' or 'nvcuvid.h'                                                                           
# exists, otherwise build will fail                                                                                                           
# See https://github.com/opencv/opencv_contrib/issues/1786                                                                                    
conflicts('cuda@10:', when='+cudacodec')
conflicts('cuda', when='~contrib', msg='cuda support requires +contrib')

At a first glance both conflicts seem to be wrong, as they should start with a ^ sigil:

# TODO For Cuda >= 10, make sure 'dynlink_nvcuvid.h' or 'nvcuvid.h'                                                                           
# exists, otherwise build will fail                                                                                                           
# See https://github.com/opencv/opencv_contrib/issues/1786                                                                                    
conflicts('^cuda@10:', when='+cudacodec')
conflicts('^cuda', when='~contrib', msg='cuda support requires +contrib')

This poses the question of what to do in cases where we have a typo in one of the conflict directive. Should we error out like now, maybe with a better error message, or should we disregard the conflict and eventually print a warning to screen?

alalazo and others added 19 commits November 17, 2020 14:11
As reported, conflicts with compiler ranges were not treated
correctly. This commit adds tests to verify the expected behavior
for the new concretizer.

The new rules to enforce a correct behavior involve:
- Adding a rule to prefer the compiler selected for
  the root package, if no other preference is set
- Give a strong negative weight to compiler preferences
  expressed in packages.yaml
- Maximize on compiler AND compiler version match
TODO: Investigate the need to remove
memoization on Spec.patches (infinite
recursion when testing `__contains__`)
If the version list passed to one_of_iff is empty, it still generates a
rule like this:

    node_compiler_version_satisfies("fujitsu-mpi", "arm", ":") :- 1 {  } 1.
    1 {  } 1 :- node_compiler_version_satisfies("fujitsu-mpi", "arm", ":").

The cardinality rules on the right and left above are never
satisfiale, and these rules do nothing.

- [x] Skip generating any rules at all for empty version lists.
…e bodies.

`node_compiler_hard()` means that something explicitly asked for a node's
compiler to be set -- i.e., it's not inherited, it's required. We're
generating this in spec_clauses even for specs in rule bodies, which
results in conditions like this for optional dependencies:

In py-torch/package.py:

    depends_on('llvm-openmp', when='%apple-clang +openmp')

In the generated ASP:

    declared_dependency("py-torch","llvm-openmp","build")
      :- node("py-torch"),
         variant_value("py-torch","openmp","True"),
         node_compiler("py-torch","apple-clang"),
         node_compiler_hard("py-torch","apple-clang"),
         node_compiler_version_satisfies("py-torch","apple-clang",":").

The `node_compiler_hard` there means we would have to *explicitly* set
py-torch's compiler to trigger the llvm-openmp dependency, rather than
just letting it be set by preferences. This is wrong; the dependency
should be there regardless of how the compiler was set.

- [x] remove fn.node_compiler_hard() call from spec_clauses when
      generating rule body clauses.
…ernal

This commit introduces a new rule:

real_node(Package) :- not external(Package), node(Package).

that permits to distinguish between an external node and a
real node that shouldn't trim dependency. It solves the
case of concretizing ninja with an external Python.
…variant

If a the default of a multi-valued variant is set to
multiple values either in package.py or in packages.yaml
we need to ensure that all the values are present in the
concretized spec.

Since each default value has a weight of 0 and the
variant value is set implicitly by the concretizer
we need to add a rule to maximize on the number of
default values that are used.
This commit address the case of concretizing a root spec with a
transitive conditional dependency on a virtual package, provided
by an external. Before these modifications default variant values
for the dependency bringing in the virtual package were not
respected, and the external package providing the virtual was added
to the DAG.

The issue stems from two facts:
- Selecting a provider has higher precedence than selecting default variants
- To ensure that an external is preferred, we used a negative weight

To solve it we shift all the providers weight so that:
- External providers have a weight of 0
- Non external provider have a weight of 10 or more

Using a weight of zero for external providers is such that having
an external provider, if present, or not having a provider at all
has the same effect on the higher priority minimization.

Also fixed a few minor bugs in concretize.lp, that were causing
spurious entries in the final answer set.

Cleaned concretize.lp from leftover rules.
@alalazo alalazo force-pushed the features/solver-rebased branch from bb3782b to 623ed6a Compare November 17, 2020 13:13
Copy link
Copy Markdown
Member

@tgamblin tgamblin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we've stomped out most of the lingering bugs here, tests are passing, and thanks to @aweits and everyone else for finding bugs and making some major speed improvements.
Thanks to @alalazo for pushing this over the finish line and stomping out all the corner cases!

This is ready to merge as experimental, and getting it in develop will enable more people to take a stab at using it.

We'll do docs in a follow-up PR so that we don't have to keep waiting on CI here.

Other things that really need doing before this is the default concretizer:

  • Better error reporting (we need to add error messages to facts and rules and print them in a more understandable way)
  • Figure out how best to vendor a solver. We may use Z3 in the end for this, or we may figure out how to vendor Clingo.
  • More in-depth performance analysis, and a performance comparison. @aweits has played with some tactics here, but there are a few other things we can do to make this better.

Notably, when I run r-condop through the Clingo Python API, it takes 24 seconds, and performance stats look like this:

Time:
    setup:         6.6440
    load:          0.0037
    ground:        7.8134
    solve:         8.2264
Total: 25.3642

If, however, I run:

$ spack solve --show asp r-condop > r-condop.lp
...
$ clingo ./r-condop.lp
...
SATISFIABLE
Models       : 1
Calls        : 1
Time         : 1.374s (Solving: 0.01s 1st Model: 0.00s Unsat: 0.01s)
CPU Time     : 1.362s

It spits out a model in 1.5s -- so there's a lot of room to improve here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

8 participants