Skip to content

specs: move to new spec.json format with build provenance#22845

Merged
tgamblin merged 89 commits intospack:developfrom
nhanford:features/build-spec-json
Sep 9, 2021
Merged

specs: move to new spec.json format with build provenance#22845
tgamblin merged 89 commits intospack:developfrom
nhanford:features/build-spec-json

Conversation

@nhanford
Copy link
Copy Markdown
Contributor

@nhanford nhanford commented Apr 7, 2021

This is a major rework of Spack's core core spec.yaml metadata format. It moves from spec.yaml to spec.json for speed, and it changes the format in several ways. Specifically:

  1. The spec format now has a _meta section with a version (now set to version 2). This will simplify major changes like this one in the future.
  2. The node list in spec dictionaries is no longer keyed by name. Instead, it is a list of records with no required key. The name, hash, etc. are fields in the dictionary records like any other.
  3. Dependencies can be keyed by any hash (hash, full_hash, build_hash).
  4. build_spec provenance from New splice method in class Spec. #20262 is included in the spec format. This means that, for spliced specs, we preserve the full provenance of how to build, and we can reproduce a spliced spec from the original builds that produced it.

NOTE: Because we have switched the spec format, this PR changes Spack's hashing algorithm. This means that after this commit, Spack will think a lot of things need rebuilds.

There are two major benefits this PR provides:

  • The switch to JSON format speeds up Spack significantly, as Python's builtin JSON implementation is orders of magnitude faster than YAML.
  • The new Spec format will soon allow us to represent DAGs with potentially multiple versions of the same dependency -- e.g., for build dependencies or for compilers-as-dependencies. This PR lays the necessary groundwork for those features.

The old spec.yaml format continues to be supported, but is now considered a legacy format, and Spack will opportunistically convert these to the new spec.json format.

@nhanford nhanford added the yaml label Apr 7, 2021
@nhanford nhanford requested review from becker33 and tgamblin April 7, 2021 17:05
@nhanford nhanford marked this pull request as ready for review May 20, 2021 22:06
@nhanford nhanford force-pushed the features/build-spec-json branch 2 times, most recently from 8e293f5 to 8003bf0 Compare May 24, 2021 15:39
return hash(lang.tuplify(self._cmp_iter))

def __reduce__(self):
return _spec_from_dict, (self.to_dict(hash=ht.build_hash),)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be wrong, at least in develop, in the sense that using the default means using the DAG hash. The information returned here should be enough to reconstruct an equivalent Spec object when we unpickle (and I think pickle caches these results, though I can look for a definitive confirmation of that).

Since we serialize to pass the information to a spawned process on macOS and we don't seek persistence, I felt like a change in the recipe was ruled out and build hash was the right granularity. Using the dag hash will make any spec with different build deps equivalent to each other.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the build hash makes sense. However, this PR subtly changes to_dict so that when the caller specifies a hash, it writes and "indexes by" (checks node uniqueness by) that hash. I was trying to avoid an implementation that looks at all the nodes and sees if they all have a full hash, then a build hash, then just go by DAG hash, but that might be necessary.

Comment on lines +1850 to +1858
if 'name' in node.keys():
# New format
name = node['name']
else:
# Old format
name = next(iter(node))
node = node[name]
for h in ht.hashes:
setattr(spec, h.attr, node.get(h.name, None))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not asking for a refactor here, but in general I wonder if when we upgrade formats we should have separate functions to read different formats.

@tgamblin tgamblin changed the title Reformat spec.yaml/spec.json to accommodate build provenance. specs: move to new spec.json format with build provenance Sep 9, 2021
@tgamblin tgamblin merged commit d83f711 into spack:develop Sep 9, 2021
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Sep 9, 2021

🥳

@haampie
Copy link
Copy Markdown
Member

haampie commented Sep 9, 2021

FWIW, test speeds using the data from #25507 and the merge commit from this pr (there's been some new tests, prs, features in the meantime of course):

name before after
unittests (2.7, original) 2230.26s 1524.15s (-31.7%)
unittests (2.7, clingo) 2863.46s 1800.83s (-37.1%)
unittests (3.5, original) 1223.45s 1298.06s ( 6.1%)
unittests (3.5, clingo) 2236.52s 1493.50s (-33.2%)
unittests (3.6, original) 1399.90s 1112.28s (-20.5%)
unittests (3.6, clingo) 1637.72s 1499.34s ( -8.4%)
unittests (3.7, original) 1542.14s 1141.43s (-26.0%)
unittests (3.7, clingo) 1891.40s 1873.94s ( -0.9%)
unittests (3.8, original) 1507.68s 1190.16s (-21.1%)
unittests (3.8, clingo) 1926.32s 1423.82s (-26.1%)
unittests (3.9, original) 1517.87s 1447.19s ( -4.7%)
unittests (3.9, clingo) 2315.39s 1520.50s (-34.3%)
build (3.8) 2569.17s 3152.42s ( 22.7%)

Total time is -17.6%

Not sure what's up on macos, seems like it regressed before this pr

tgamblin pushed a commit that referenced this pull request Sep 13, 2021
#22845 revealed a long-standing bug that had never been triggered before, because the
hashing algorithm had been stable for multiple years while the bug was in production. The
bug was that when reading a concretized environment, Spack did not properly read in the
build hashes associated with the specs in the environment. Those hashes were recomputed
(and as long as we didn't change the algorithm, were recomputed identically). Spack's
policy, though, is never to recompute a hash. Once something is installed, we respect its
metadata hash forever -- even if internally Spack changes the hashing method. Put
differently, once something is concretized, it has a concrete hash, and that's it -- forever.

When we changed the hashing algorithm for performance in #22845 we exposed the bug.
This PR fixes the bug at its source, but properly reading in the cached build hash attributes
associated with the specs. I've also renamed some variables in the Environment class
methods to make a mistake of this sort more difficult to make in the future.

* ensure environment build hashes are never recomputed
* add comment clarifying reattachment of env build hashes
* bump lockfile version and include specfile version in env meta
* Fix unit-test for v1 to v2 conversion

Co-authored-by: Massimiliano Culpo <[email protected]>
if 'build_spec' in node.keys():
node_hash = node[hash_type]
else:
node_hash = node[hash_type]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalazo Here's where the block we were discussing in #28504 was added, maybe @nhanford recalls what the intent was. But likely not given the size of this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is vestigial from when I was experimenting with forcing a certain hash type for spliced specs, and then I must have quickly done a "replace all" to fix it. Please feel free to remove the unnecessary conditional or let me know and I can do it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nhanford I have removed the conditional in the linked PR above, we just wanted to make sure we weren't covering up the trail for someone diagnosing some potential future issue.

tgamblin pushed a commit that referenced this pull request Mar 10, 2022
)

Change the internal representation of `Spec` to allow for multiple dependencies or
dependents stemming from the same package. This change permits to represent cases
which are frequent in cross compiled environments or to bootstrap compilers.

Modifications:
- [x] Substitute `DependencyMap` with `_EdgeMap`. The main differences are that the
      latter does not support direct item assignment and can be modified only through its
      API. It also provides a `select_by` method to query items.
- [x] Reworked a few public APIs of `Spec` to get list of dependencies or related edges.
- [x] Added unit tests to prevent regression on #11983 and prove the synthetic construction
      of specs with multiple deps from the same package. 

Since #22845 went in first, this PR reuses that format and thus it should not change hashes.
The same package may be present multiple times in the list of dependencies with different
associated specs (each with its own hash).
@haampie haampie added the hash-change things that will change a lot of hashes label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binary-packages commands environments hash-change things that will change a lot of hashes tests General test capability(ies) utilities yaml

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants