Skip to content

parser: refactor with coarser token granularity#34151

Merged
tgamblin merged 20 commits intospack:developfrom
alalazo:refactors/parser
Dec 7, 2022
Merged

parser: refactor with coarser token granularity#34151
tgamblin merged 20 commits intospack:developfrom
alalazo:refactors/parser

Conversation

@alalazo
Copy link
Copy Markdown
Member

@alalazo alalazo commented Nov 28, 2022

fixes #13752

Motivation

Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals:

spack/lib/spack/spack/spec.py

Lines 5085 to 5108 in 066ec31

elif self.accept(DEP):
if not specs:
# We're parsing an anonymous spec beginning with a
# dependency. Push the token to recover after creating
# anonymous spec
self.push_tokens([self.token])
specs.append(self.spec(None))
else:
dep = None
if self.accept(FILE):
# this may return None, in which case we backtrack
dep = self.spec_from_file()
if not dep and self.accept(HASH):
# We're finding a dependency by hash for an
# anonymous spec
dep = self.spec_by_hash()
dep = dep.copy(deps=("link", "run"))
if not dep:
# We're adding a dependency to the last spec
if self.accept(ID):
self.previous = self.token
if self.accept(EQ):

In the future, to turn compilers into proper dependencies, we'll have to increase the complexity further as we foresee the need to add:

  1. Edge attributes
  2. Spec nesting

to the spec syntax (see spack/seps#5 for an initial discussion of those changes). The main attempt here is thus to simplify the existing code before we start extending it later. We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L270-L357

and a trivial lexer:

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L134-L158

Differences with the previous parser

There are currently 2 known differences with the previous parser, which have been added on purpose:

  • No spaces allowed after a sigil (e.g. foo @ 1.2.3 is invalid while foo @1.2.3 is valid)
  • /<hash> @1.2.3 can be parsed as a concrete spec followed by an anonymous spec (before was invalid)

We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.

The parser is currently 100% backward compatible.

Error handling

Being based on more complex regexes, we can possibly improve error handling by adding regexes for common issues and hint users on that. I'll leave that for a following PR, but there's a stub for this approach here:

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L155-L158

Performance

To be sure we don't add any performance penalty with this new encoding, I measured:

$ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)"

for different specs on my machine:

  • Spack: 0.20.0.dev0 (c9db4e5)
  • Python: 3.8.10
  • Platform: linux-ubuntu20.04-icelake
  • Concretizer: clingo

results are:

Spec develop this PR
trilinos 28.9 usec 13.1 usec
trilinos @1.2.10:1.4.20,2.0.1 131 usec 120 usec
trilinos %gcc 44.9 usec 20.9 usec
trilinos +foo 44.1 usec 21.3 usec
trilinos foo=bar 59.5 usec 25.6 usec
trilinos foo=bar ^ mpich foo=baz 120 usec 82.1 usec

so this new parser seems to be consistently faster than the previous one.

Modifications

In this PR we just substituted the Spec parser, which means:

  • Deleted in spec.py the SpecParser and SpecLexer classes. deleted spack/parse.py
  • Added a new parser in spack/parser.py
  • Hooked the new parser in all the places the previous one was used
  • Adapted unit tests in test/spec_syntax.py

Possible future improvements

Random thoughts while working on the PR:

  • Currently we transform hashes and files into specs during parsing. I think we might want to introduce an additional step and parse special objects like a FileSpec etc. in-between parsing and concretization.

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Nov 28, 2022

Hi @alalazo! I noticed that the following package(s) don't yet have maintainers:

  • bridger
  • exabayes
  • miniqmc
  • mumps

Are you interested in adopting any of these package(s)? If so, simply add the following to the package class:

    maintainers = ["alalazo"]

If not, could you contact the developers of this package and see if they are interested? You can quickly see who has worked on a package with spack blame:

$ spack blame bridger

Thank you for your help! Please don't add maintainers without their consent.

You don't have to be a Spack expert or package developer in order to be a "maintainer," it just gives us a list of users willing to review PRs or debug issues relating to this package. A package can have multiple maintainers; just add a list of GitHub handles of anyone who wants to volunteer.

@spackbot-app spackbot-app bot requested a review from michaelkuhn November 28, 2022 10:17
@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Nov 28, 2022

@tomstitt can you review this PR?

This PR modifies the following package(s), for which you are listed as a maintainer:

  • gcc
  • xeus

@spackbot-app spackbot-app bot added the documentation Improvements or additions to documentation label Nov 28, 2022
@alalazo alalazo marked this pull request as ready for review November 28, 2022 14:07
@alalazo alalazo requested a review from tgamblin November 28, 2022 14:32
Copy link
Copy Markdown
Member

@tgamblin tgamblin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great and it's clearly way faster. Some simplifying suggestions below.

RE:

We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):

I don't think the performance win here is actually the nesting, though it's related. It's that you're compiling the regexes and delegating much more parsing responsibility to the regex engine. So where the prior parser would loop several times in Python for each token, this parser is leveraging fast regex parsing to grab a bigger chunk on the C side and get it all unpacked in one iteration on the Python side. Makes a lot of sense and I think the regexes are very clear.

RE: differences:

  • No spaces allowed after a sigil (e.g. foo @ 1.2.3 is invalid while foo @1.2.3 is valid)

I don't see what the point is of enforcing this, and I don't know many language parsers that don't have optional whitespace between operators. I think this is going to break private package repositories and scripts built on top of Spack, for not a lot of benefit. So can you just add \s* to the relevant regexes? I think with the simplifications I suggest below it'll still be readable.

  • / @1.2.3 can be parsed as a concrete spec followed by an anonymous spec (before was invalid)

The reason I implemented it this way initially is for the CLI use case:

(spackbook):spack> spack uninstall [email protected]
==> Error: [email protected] matches multiple packages:

    -- darwin-bigsur-skylake / [email protected] -------------------
    wx7u4y4 [email protected]  x24x4ec [email protected]

==> Error: You can either:
    a) use a more specific spec, or
    b) specify the spec by its hash (e.g. `spack uninstall /hash`), or
    c) use `spack uninstall --all` to uninstall ALL matching specs.

# Now you can hit up and just add the hash
> spack uninstall python/wx7u4y4 @3.8

Though now that I've written that I realize they could just do:

> spack uninstall python @3.8 /wx7u4y4

However, it still doesn't really sit well with me that these would be different:

python @3.8 /wx7u4y4  # one spec
python /wx7u4y4 @3.8  # two specs

I think that is pretty confusing. There is no other place in the Spec grammar where anything but the initial absence of an identifier creates an anonymous spec, and the only way you could initiate a new spec on the CLI was to type a name or a ^ for a dependency constraint. Now, hashes are special and always terminate the current spec parse, while no other sigil does that.

I think it's better without that inconsistency.

@alalazo
Copy link
Copy Markdown
Member Author

alalazo commented Dec 6, 2022

I don't see what the point is of enforcing this, and I don't know many language parsers that don't have optional whitespace between operators.

@tgamblin I tried adding this back, but it will complicate the parsing due to version ranges:

@ 1.2.3               :                  develop  # This is a version range 1.2.3:develop
@ 1.2.3               :                  develop=foo  # This is a version range 1.2.3: followed by a key-value pair

so I would be for keeping the parser as is in this PR, since it's true it may break private packages - but in a way that is fixable and backward compatible. Also I would say that @ (or + or ~ etc.) is not an "operator" per se, since it doesn't combine two homogeneous factors1. To me it's the entire @1.2.3:develop to be a "version symbol".

Example where spaces could be allowed, but are not

>>> def foo      :
  File "<stdin>", line 1
    def foo      :
                 ^
SyntaxError: invalid syntax

Footnotes

  1. Not sure I am expressing myself in a formally correct way, but I mean that we don't have rules like expr = expr @ term. Rather we have an optional "version symbol" that might appear in each spec node.

Our parser grew to be quite complex, with a 2-state
lexer and logic in the parser that has up to 5 levels
of nested conditionals.

To simplify the code, here we add a new parser with
a different token granularity. Each token represent
a lexeme with a semantic effect on the associated
spec, so both the lexer and parser logic can be
greatly simplified.

Differences with the previous parser:
- no spaces allowed after a sigil (e.g. "foo @ 1.2.3" is invalid
  while "foo @1.2.3" is valid)
- "/<hash> @1.2.3" can be parsed as two specs (before was invalid)
@alalazo
Copy link
Copy Markdown
Member Author

alalazo commented Dec 6, 2022

There is no other place in the Spec grammar where anything but the initial absence of an identifier creates an anonymous spec, and the only way you could initiate a new spec on the CLI was to type a name or a ^ for a dependency constraint. Now, hashes are special and always terminate the current spec parse, while no other sigil does that. I think it's better without that inconsistency.

Restored the old behavior for hashes in 624cb43

@tgamblin tgamblin merged commit ab6499c into spack:develop Dec 7, 2022
@tgamblin tgamblin changed the title Rework the parser to have a different token granularity parser: refactor with coarser token granularity Dec 7, 2022
@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Dec 7, 2022

@alalazo thanks for getting this to 100% compatibility! It looks great to me.

@alalazo alalazo deleted the refactors/parser branch December 8, 2022 08:21
amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023
## Motivation

Our parser grew to be quite complex, with a 2-state lexer and logic in the parser
that has up to 5 levels of nested conditionals. In the future, to turn compilers into
proper dependencies, we'll have to increase the complexity further as we foresee
the need to add:
1. Edge attributes
2. Spec nesting

to the spec syntax (see spack/seps#5 for an initial discussion of
those changes).  The main attempt here is thus to _simplify the existing code_ before
we start extending it later. We try to do that by adopting a different token granularity,
and by using more complex regexes for tokenization. This allow us to a have a "flatter"
encoding for the parser. i.e., it has fewer nested conditionals and a near-trivial lexer.

There are places, namely in `VERSION`, where we have to use negative lookahead judiciously
to avoid ambiguity.  Specifically, this parse is ambiguous without `(?!\s*=)` in `VERSION_RANGE`
and an extra final `\b` in `VERSION`:

```
@ 1.2.3     :        develop  # This is a version range 1.2.3:develop
@ 1.2.3     :        develop=foo  # This is a version range 1.2.3: followed by a key-value pair
```

## Differences with the previous parser

~There are currently 2 known differences with the previous parser, which have been added on purpose:~

- ~No spaces allowed after a sigil (e.g. `foo @ 1.2.3` is invalid while `foo @1.2.3` is valid)~
- ~`/<hash> @1.2.3` can be parsed as a concrete spec followed by an anonymous spec (before was invalid)~

~We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.~

The parser is currently 100% backward compatible.

## Error handling

Being based on more complex regexes, we can possibly improve error
handling by adding regexes for common issues and hint users on that.
I'll leave that for a following PR, but there's a stub for this approach in the PR.

## Performance

To be sure we don't add any performance penalty with this new encoding, I measured:
```console
$ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)"
```
for different specs on my machine:

* **Spack:** 0.20.0.dev0 (c9db4e5)
* **Python:** 3.8.10
* **Platform:** linux-ubuntu20.04-icelake
* **Concretizer:** clingo

results are:

| Spec          | develop       | this PR |
| ------------- | ------------- | ------- |
| `trilinos`  |  28.9 usec | 13.1 usec |
| `trilinos @1.2.10:1.4.20,2.0.1`  | 131 usec  | 120 usec |
| `trilinos %gcc`  | 44.9 usec  | 20.9 usec |
| `trilinos +foo`  | 44.1 usec  | 21.3 usec |
| `trilinos foo=bar`  | 59.5 usec  | 25.6 usec |
| `trilinos foo=bar ^ mpich foo=baz`  | 120 usec  | 82.1 usec |

so this new parser seems to be consistently faster than the previous one.

## Modifications

In this PR we just substituted the Spec parser, which means:
- [x] Deleted in `spec.py` the `SpecParser` and `SpecLexer` classes. deleted `spack/parse.py`
- [x] Added a new parser in `spack/parser.py`
- [x] Hooked the new parser in all the places the previous one was used
- [x] Adapted unit tests in `test/spec_syntax.py`


## Possible future improvements

Random thoughts while working on the PR:
- Currently we transform hashes and files into specs during parsing. I think
we might want to introduce an additional step and parse special objects like
a `FileSpec` etc. in-between parsing and concretization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands conflicts core PR affects Spack core functionality dependencies directives documentation Improvements or additions to documentation patch tests General test capability(ies) update-package versions virtual-dependencies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec parsing broken for key=value pairs

2 participants