parser: refactor with coarser token granularity#34151
parser: refactor with coarser token granularity#34151tgamblin merged 20 commits intospack:developfrom
Conversation
|
Hi @alalazo! I noticed that the following package(s) don't yet have maintainers:
Are you interested in adopting any of these package(s)? If so, simply add the following to the package class: maintainers = ["alalazo"]If not, could you contact the developers of this package and see if they are interested? You can quickly see who has worked on a package with $ spack blame bridgerThank you for your help! Please don't add maintainers without their consent. You don't have to be a Spack expert or package developer in order to be a "maintainer," it just gives us a list of users willing to review PRs or debug issues relating to this package. A package can have multiple maintainers; just add a list of GitHub handles of anyone who wants to volunteer. |
|
@tomstitt can you review this PR? This PR modifies the following package(s), for which you are listed as a maintainer:
|
f5db17d to
4093239
Compare
tgamblin
left a comment
There was a problem hiding this comment.
This is looking great and it's clearly way faster. Some simplifying suggestions below.
RE:
We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):
I don't think the performance win here is actually the nesting, though it's related. It's that you're compiling the regexes and delegating much more parsing responsibility to the regex engine. So where the prior parser would loop several times in Python for each token, this parser is leveraging fast regex parsing to grab a bigger chunk on the C side and get it all unpacked in one iteration on the Python side. Makes a lot of sense and I think the regexes are very clear.
RE: differences:
- No spaces allowed after a sigil (e.g. foo @ 1.2.3 is invalid while foo @1.2.3 is valid)
I don't see what the point is of enforcing this, and I don't know many language parsers that don't have optional whitespace between operators. I think this is going to break private package repositories and scripts built on top of Spack, for not a lot of benefit. So can you just add \s* to the relevant regexes? I think with the simplifications I suggest below it'll still be readable.
- / @1.2.3 can be parsed as a concrete spec followed by an anonymous spec (before was invalid)
The reason I implemented it this way initially is for the CLI use case:
(spackbook):spack> spack uninstall [email protected]
==> Error: [email protected] matches multiple packages:
-- darwin-bigsur-skylake / [email protected] -------------------
wx7u4y4 [email protected] x24x4ec [email protected]
==> Error: You can either:
a) use a more specific spec, or
b) specify the spec by its hash (e.g. `spack uninstall /hash`), or
c) use `spack uninstall --all` to uninstall ALL matching specs.
# Now you can hit up and just add the hash
> spack uninstall python/wx7u4y4 @3.8
Though now that I've written that I realize they could just do:
> spack uninstall python @3.8 /wx7u4y4
However, it still doesn't really sit well with me that these would be different:
python @3.8 /wx7u4y4 # one spec
python /wx7u4y4 @3.8 # two specs
I think that is pretty confusing. There is no other place in the Spec grammar where anything but the initial absence of an identifier creates an anonymous spec, and the only way you could initiate a new spec on the CLI was to type a name or a ^ for a dependency constraint. Now, hashes are special and always terminate the current spec parse, while no other sigil does that.
I think it's better without that inconsistency.
4093239 to
c3e8f07
Compare
@tgamblin I tried adding this back, but it will complicate the parsing due to version ranges: so I would be for keeping the parser as is in this PR, since it's true it may break private packages - but in a way that is fixable and backward compatible. Also I would say that Example where spaces could be allowed, but are not >>> def foo :
File "<stdin>", line 1
def foo :
^
SyntaxError: invalid syntaxFootnotes
|
Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals. To simplify the code, here we add a new parser with a different token granularity. Each token represent a lexeme with a semantic effect on the associated spec, so both the lexer and parser logic can be greatly simplified. Differences with the previous parser: - no spaces allowed after a sigil (e.g. "foo @ 1.2.3" is invalid while "foo @1.2.3" is valid) - "/<hash> @1.2.3" can be parsed as two specs (before was invalid)
0b94c97 to
d40379a
Compare
Restored the old behavior for hashes in 624cb43 |
… parser" This reverts commit 21edacc.
|
@alalazo thanks for getting this to 100% compatibility! It looks great to me. |
## Motivation Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals. In the future, to turn compilers into proper dependencies, we'll have to increase the complexity further as we foresee the need to add: 1. Edge attributes 2. Spec nesting to the spec syntax (see spack/seps#5 for an initial discussion of those changes). The main attempt here is thus to _simplify the existing code_ before we start extending it later. We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser. i.e., it has fewer nested conditionals and a near-trivial lexer. There are places, namely in `VERSION`, where we have to use negative lookahead judiciously to avoid ambiguity. Specifically, this parse is ambiguous without `(?!\s*=)` in `VERSION_RANGE` and an extra final `\b` in `VERSION`: ``` @ 1.2.3 : develop # This is a version range 1.2.3:develop @ 1.2.3 : develop=foo # This is a version range 1.2.3: followed by a key-value pair ``` ## Differences with the previous parser ~There are currently 2 known differences with the previous parser, which have been added on purpose:~ - ~No spaces allowed after a sigil (e.g. `foo @ 1.2.3` is invalid while `foo @1.2.3` is valid)~ - ~`/<hash> @1.2.3` can be parsed as a concrete spec followed by an anonymous spec (before was invalid)~ ~We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.~ The parser is currently 100% backward compatible. ## Error handling Being based on more complex regexes, we can possibly improve error handling by adding regexes for common issues and hint users on that. I'll leave that for a following PR, but there's a stub for this approach in the PR. ## Performance To be sure we don't add any performance penalty with this new encoding, I measured: ```console $ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)" ``` for different specs on my machine: * **Spack:** 0.20.0.dev0 (c9db4e5) * **Python:** 3.8.10 * **Platform:** linux-ubuntu20.04-icelake * **Concretizer:** clingo results are: | Spec | develop | this PR | | ------------- | ------------- | ------- | | `trilinos` | 28.9 usec | 13.1 usec | | `trilinos @1.2.10:1.4.20,2.0.1` | 131 usec | 120 usec | | `trilinos %gcc` | 44.9 usec | 20.9 usec | | `trilinos +foo` | 44.1 usec | 21.3 usec | | `trilinos foo=bar` | 59.5 usec | 25.6 usec | | `trilinos foo=bar ^ mpich foo=baz` | 120 usec | 82.1 usec | so this new parser seems to be consistently faster than the previous one. ## Modifications In this PR we just substituted the Spec parser, which means: - [x] Deleted in `spec.py` the `SpecParser` and `SpecLexer` classes. deleted `spack/parse.py` - [x] Added a new parser in `spack/parser.py` - [x] Hooked the new parser in all the places the previous one was used - [x] Adapted unit tests in `test/spec_syntax.py` ## Possible future improvements Random thoughts while working on the PR: - Currently we transform hashes and files into specs during parsing. I think we might want to introduce an additional step and parse special objects like a `FileSpec` etc. in-between parsing and concretization.
fixes #13752
Motivation
Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals:
spack/lib/spack/spack/spec.py
Lines 5085 to 5108 in 066ec31
In the future, to turn compilers into proper dependencies, we'll have to increase the complexity further as we foresee the need to add:
to the spec syntax (see spack/seps#5 for an initial discussion of those changes). The main attempt here is thus to simplify the existing code before we start extending it later. We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):
https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L270-L357
and a trivial lexer:
https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L134-L158
Differences with the previous parser
There are currently 2 known differences with the previous parser, which have been added on purpose:No spaces allowed after a sigil (e.g.foo @ 1.2.3is invalid whilefoo @1.2.3is valid)/<hash> @1.2.3can be parsed as a concrete spec followed by an anonymous spec (before was invalid)We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.The parser is currently 100% backward compatible.
Error handling
Being based on more complex regexes, we can possibly improve error handling by adding regexes for common issues and hint users on that. I'll leave that for a following PR, but there's a stub for this approach here:
https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L155-L158
Performance
To be sure we don't add any performance penalty with this new encoding, I measured:
$ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)"for different specs on my machine:
results are:
trilinostrilinos @1.2.10:1.4.20,2.0.1trilinos %gcctrilinos +footrilinos foo=bartrilinos foo=bar ^ mpich foo=bazso this new parser seems to be consistently faster than the previous one.
Modifications
In this PR we just substituted the Spec parser, which means:
spec.pytheSpecParserandSpecLexerclasses. deletedspack/parse.pyspack/parser.pytest/spec_syntax.pyPossible future improvements
Random thoughts while working on the PR:
FileSpecetc. in-between parsing and concretization.