parser: refactor with coarser token granularity by alalazo · Pull Request #34151 · spack/spack

alalazo · 2022-11-28T10:17:10Z

Motivation

Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals:

spack/lib/spack/spack/spec.py

Lines 5085 to 5108 in 066ec31

    
           elif self.accept(DEP): 
        
               if not specs: 
        
                   # We're parsing an anonymous spec beginning with a 
        
                   # dependency. Push the token to recover after creating 
        
                   # anonymous spec 
        
                   self.push_tokens([self.token]) 
        
                   specs.append(self.spec(None)) 
        
               else: 
        
                   dep = None 
        
                   if self.accept(FILE): 
        
                       # this may return None, in which case we backtrack 
        
                       dep = self.spec_from_file() 
        
                   if not dep and self.accept(HASH): 
        
                       # We're finding a dependency by hash for an 
        
                       # anonymous spec 
        
                       dep = self.spec_by_hash() 
        
                       dep = dep.copy(deps=("link", "run")) 
        
                   if not dep: 
        
                       # We're adding a dependency to the last spec 
        
                       if self.accept(ID): 
        
                           self.previous = self.token 
        
                           if self.accept(EQ):

In the future, to turn compilers into proper dependencies, we'll have to increase the complexity further as we foresee the need to add:

Edge attributes
Spec nesting

to the spec syntax (see spack/seps#5 for an initial discussion of those changes). The main attempt here is thus to simplify the existing code before we start extending it later. We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L270-L357

and a trivial lexer:

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L134-L158

Differences with the previous parser

~~There are currently 2 known differences with the previous parser, which have been added on purpose:~~

~~No spaces allowed after a sigil (e.g. foo @ 1.2.3 is invalid while foo @1.2.3 is valid)~~
~~/<hash> @1.2.3 can be parsed as a concrete spec followed by an anonymous spec (before was invalid)~~

~~We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.~~

The parser is currently 100% backward compatible.

Error handling

Being based on more complex regexes, we can possibly improve error handling by adding regexes for common issues and hint users on that. I'll leave that for a following PR, but there's a stub for this approach here:

https://github.com/alalazo/spack/blob/ee99ded647691a355a31b548da667ffa3ec7318c/lib/spack/spack/parser.py#L155-L158

Performance

To be sure we don't add any performance penalty with this new encoding, I measured:

$ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)"

for different specs on my machine:

Spack: 0.20.0.dev0 (c9db4e5)
Python: 3.8.10
Platform: linux-ubuntu20.04-icelake
Concretizer: clingo

results are:

Spec	develop	this PR
`trilinos`	28.9 usec	13.1 usec
`trilinos @1.2.10:1.4.20,2.0.1`	131 usec	120 usec
`trilinos %gcc`	44.9 usec	20.9 usec
`trilinos +foo`	44.1 usec	21.3 usec
`trilinos foo=bar`	59.5 usec	25.6 usec
`trilinos foo=bar ^ mpich foo=baz`	120 usec	82.1 usec

so this new parser seems to be consistently faster than the previous one.

Modifications

In this PR we just substituted the Spec parser, which means:

Deleted in spec.py the SpecParser and SpecLexer classes. deleted spack/parse.py
Added a new parser in spack/parser.py
Hooked the new parser in all the places the previous one was used
Adapted unit tests in test/spec_syntax.py

Possible future improvements

Random thoughts while working on the PR:

Currently we transform hashes and files into specs during parsing. I think we might want to introduce an additional step and parse special objects like a FileSpec etc. in-between parsing and concretization.

spackbot-app · 2022-11-28T10:17:22Z

Hi @alalazo! I noticed that the following package(s) don't yet have maintainers:

bridger
exabayes
miniqmc
mumps

Are you interested in adopting any of these package(s)? If so, simply add the following to the package class:

    maintainers = ["alalazo"]

If not, could you contact the developers of this package and see if they are interested? You can quickly see who has worked on a package with spack blame:

$ spack blame bridger

Thank you for your help! Please don't add maintainers without their consent.

You don't have to be a Spack expert or package developer in order to be a "maintainer," it just gives us a list of users willing to review PRs or debug issues relating to this package. A package can have multiple maintainers; just add a list of GitHub handles of anyone who wants to volunteer.

spackbot-app · 2022-11-28T10:17:24Z

@tomstitt can you review this PR?

This PR modifies the following package(s), for which you are listed as a maintainer:

gcc
xeus

tgamblin

This is looking great and it's clearly way faster. Some simplifying suggestions below.

RE:

We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser (i.e. with less nested conditionals):

I don't think the performance win here is actually the nesting, though it's related. It's that you're compiling the regexes and delegating much more parsing responsibility to the regex engine. So where the prior parser would loop several times in Python for each token, this parser is leveraging fast regex parsing to grab a bigger chunk on the C side and get it all unpacked in one iteration on the Python side. Makes a lot of sense and I think the regexes are very clear.

RE: differences:

No spaces allowed after a sigil (e.g. foo @ 1.2.3 is invalid while foo @1.2.3 is valid)

I don't see what the point is of enforcing this, and I don't know many language parsers that don't have optional whitespace between operators. I think this is going to break private package repositories and scripts built on top of Spack, for not a lot of benefit. So can you just add \s* to the relevant regexes? I think with the simplifications I suggest below it'll still be readable.

/ @1.2.3 can be parsed as a concrete spec followed by an anonymous spec (before was invalid)

The reason I implemented it this way initially is for the CLI use case:

(spackbook):spack> spack uninstall [email protected]
==> Error: [email protected] matches multiple packages:

    -- darwin-bigsur-skylake / [email protected] -------------------
    wx7u4y4 [email protected]  x24x4ec [email protected]

==> Error: You can either:
    a) use a more specific spec, or
    b) specify the spec by its hash (e.g. `spack uninstall /hash`), or
    c) use `spack uninstall --all` to uninstall ALL matching specs.

# Now you can hit up and just add the hash
> spack uninstall python/wx7u4y4 @3.8

Though now that I've written that I realize they could just do:

> spack uninstall python @3.8 /wx7u4y4

However, it still doesn't really sit well with me that these would be different:

python @3.8 /wx7u4y4  # one spec
python /wx7u4y4 @3.8  # two specs

I think that is pretty confusing. There is no other place in the Spec grammar where anything but the initial absence of an identifier creates an anonymous spec, and the only way you could initiate a new spec on the CLI was to type a name or a ^ for a dependency constraint. Now, hashes are special and always terminate the current spec parse, while no other sigil does that.

I think it's better without that inconsistency.

lib/spack/spack/parser.py

alalazo · 2022-12-06T12:30:14Z

I don't see what the point is of enforcing this, and I don't know many language parsers that don't have optional whitespace between operators.

@tgamblin I tried adding this back, but it will complicate the parsing due to version ranges:

@ 1.2.3               :                  develop  # This is a version range 1.2.3:develop
@ 1.2.3               :                  develop=foo  # This is a version range 1.2.3: followed by a key-value pair

so I would be for keeping the parser as is in this PR, since it's true it may break private packages - but in a way that is fixable and backward compatible. Also I would say that @ (or + or ~ etc.) is not an "operator" per se, since it doesn't combine two homogeneous factors¹. To me it's the entire @1.2.3:develop to be a "version symbol".

Example where spaces could be allowed, but are not

>>> def foo      :
  File "<stdin>", line 1
    def foo      :
                 ^
SyntaxError: invalid syntax

Not sure I am expressing myself in a formally correct way, but I mean that we don't have rules like expr = expr @ term. Rather we have an optional "version symbol" that might appear in each spec node. ↩

Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals. To simplify the code, here we add a new parser with a different token granularity. Each token represent a lexeme with a semantic effect on the associated spec, so both the lexer and parser logic can be greatly simplified. Differences with the previous parser: - no spaces allowed after a sigil (e.g. "foo @ 1.2.3" is invalid while "foo @1.2.3" is valid) - "/<hash> @1.2.3" can be parsed as two specs (before was invalid)

alalazo · 2022-12-06T22:04:36Z

There is no other place in the Spec grammar where anything but the initial absence of an identifier creates an anonymous spec, and the only way you could initiate a new spec on the CLI was to type a name or a ^ for a dependency constraint. Now, hashes are special and always terminate the current spec parse, while no other sigil does that. I think it's better without that inconsistency.

Restored the old behavior for hashes in 624cb43

lib/spack/spack/parser.py

… parser" This reverts commit 21edacc.

tgamblin · 2022-12-07T22:57:28Z

@alalazo thanks for getting this to 100% compatibility! It looks great to me.

## Motivation Our parser grew to be quite complex, with a 2-state lexer and logic in the parser that has up to 5 levels of nested conditionals. In the future, to turn compilers into proper dependencies, we'll have to increase the complexity further as we foresee the need to add: 1. Edge attributes 2. Spec nesting to the spec syntax (see spack/seps#5 for an initial discussion of those changes). The main attempt here is thus to _simplify the existing code_ before we start extending it later. We try to do that by adopting a different token granularity, and by using more complex regexes for tokenization. This allow us to a have a "flatter" encoding for the parser. i.e., it has fewer nested conditionals and a near-trivial lexer. There are places, namely in `VERSION`, where we have to use negative lookahead judiciously to avoid ambiguity. Specifically, this parse is ambiguous without `(?!\s*=)` in `VERSION_RANGE` and an extra final `\b` in `VERSION`: ``` @ 1.2.3 : develop # This is a version range 1.2.3:develop @ 1.2.3 : develop=foo # This is a version range 1.2.3: followed by a key-value pair ``` ## Differences with the previous parser ~There are currently 2 known differences with the previous parser, which have been added on purpose:~ - ~No spaces allowed after a sigil (e.g. `foo @ 1.2.3` is invalid while `foo @1.2.3` is valid)~ - ~`/<hash> @1.2.3` can be parsed as a concrete spec followed by an anonymous spec (before was invalid)~ ~We can recover the previous behavior on both ones but, especially for the second one, it seems the current behavior in the PR is more consistent.~ The parser is currently 100% backward compatible. ## Error handling Being based on more complex regexes, we can possibly improve error handling by adding regexes for common issues and hint users on that. I'll leave that for a following PR, but there's a stub for this approach in the PR. ## Performance To be sure we don't add any performance penalty with this new encoding, I measured: ```console $ spack python -m timeit -s "import spack.spec" -c "spack.spec.Spec(<spec>)" ``` for different specs on my machine: * **Spack:** 0.20.0.dev0 (c9db4e5) * **Python:** 3.8.10 * **Platform:** linux-ubuntu20.04-icelake * **Concretizer:** clingo results are: | Spec | develop | this PR | | ------------- | ------------- | ------- | | `trilinos` | 28.9 usec | 13.1 usec | | `trilinos @1.2.10:1.4.20,2.0.1` | 131 usec | 120 usec | | `trilinos %gcc` | 44.9 usec | 20.9 usec | | `trilinos +foo` | 44.1 usec | 21.3 usec | | `trilinos foo=bar` | 59.5 usec | 25.6 usec | | `trilinos foo=bar ^ mpich foo=baz` | 120 usec | 82.1 usec | so this new parser seems to be consistently faster than the previous one. ## Modifications In this PR we just substituted the Spec parser, which means: - [x] Deleted in `spec.py` the `SpecParser` and `SpecLexer` classes. deleted `spack/parse.py` - [x] Added a new parser in `spack/parser.py` - [x] Hooked the new parser in all the places the previous one was used - [x] Adapted unit tests in `test/spec_syntax.py` ## Possible future improvements Random thoughts while working on the PR: - Currently we transform hashes and files into specs during parsing. I think we might want to introduce an additional step and parse special objects like a `FileSpec` etc. in-between parsing and concretization.

spackbot-app bot added commands conflicts core PR affects Spack core functionality dependencies directives patch tests General test capability(ies) update-package versions virtual-dependencies labels Nov 28, 2022

spackbot-app bot requested a review from michaelkuhn November 28, 2022 10:17

spackbot-app bot added the documentation Improvements or additions to documentation label Nov 28, 2022

alalazo force-pushed the refactors/parser branch from f5db17d to 4093239 Compare November 28, 2022 12:40

alalazo marked this pull request as ready for review November 28, 2022 14:07

alalazo requested a review from tgamblin November 28, 2022 14:32

tgamblin requested changes Dec 3, 2022

View reviewed changes

tgamblin requested review from becker33 and haampie December 3, 2022 23:35

alalazo force-pushed the refactors/parser branch from 4093239 to c3e8f07 Compare December 5, 2022 17:07

alalazo added 8 commits December 6, 2022 22:56

Fix a few weird whitespaces that are not allowed with the new parser

21edacc

Add a stub for error analysis regexes

542aac0

Fixup style for Python 3.6

572c9b2

Rename TokenKind -> TokenType

791a863

Rename MASTER_REGEX -> ALL_TOKENS

6db92d7

Remove VALID_ prefix to regexes

cd08846

Rename spec_buffer -> initial_spec

f27eee2

alalazo added 4 commits December 6, 2022 22:56

Move regexes inside Enum class

7041454

Simplified regex construction

10368fc

Add back tests for redundant specs

624cb43

Move EBNF grammar to spack/parser.py and update it

d40379a

alalazo force-pushed the refactors/parser branch from 0b94c97 to d40379a Compare December 6, 2022 22:02

tgamblin requested changes Dec 7, 2022

View reviewed changes

lib/spack/spack/parser.py Show resolved Hide resolved

lib/spack/spack/parser.py Outdated Show resolved Hide resolved

lib/spack/spack/parser.py Outdated Show resolved Hide resolved

lib/spack/spack/parser.py Outdated Show resolved Hide resolved

alalazo added 8 commits December 7, 2022 09:58

Renamed PACKAGE_NAME -> IDENTIFIER

c2f1ad1

Modified EBNF grammar to separate propagated variants

a303c8a

Remove TestParser class in unit tests

33bf951

Remove TODO

5b50806

Get back whitespaces in between tokens + tests

b638e86

Revert "Fix a few weird whitespaces that are not allowed with the new…

5952536

… parser" This reverts commit 21edacc.

Better specify the GIT_VERSION

9818624

Added git versions to EBNF grammar

1b67af5

tldahlgren assigned tgamblin Dec 7, 2022

tgamblin approved these changes Dec 7, 2022

View reviewed changes

tgamblin merged commit ab6499c into spack:develop Dec 7, 2022

tgamblin changed the title ~~Rework the parser to have a different token granularity~~ parser: refactor with coarser token granularity Dec 7, 2022

alalazo deleted the refactors/parser branch December 8, 2022 08:21

alalazo mentioned this pull request Dec 13, 2022

Fix parsing of key=value pairs #13753

Closed

witlox mentioned this pull request Jan 9, 2023

Module generation error when computing Lmod path hierarchy #34728

Closed

4 tasks

This was referenced Feb 20, 2023

cleanup spack.parse.Lexer #29093

Closed

remove shlex.split() preprocessing from the spec lexer/parser #33016

Closed

This was referenced Dec 12, 2023

parser: remove strip() calls that can never happen #41570

Closed

Improve parsing of quoted flags and variants in specs #41529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser: refactor with coarser token granularity#34151

parser: refactor with coarser token granularity#34151
tgamblin merged 20 commits intospack:developfrom
alalazo:refactors/parser

alalazo commented Nov 28, 2022 •

edited

Loading

Uh oh!

spackbot-app bot commented Nov 28, 2022

Uh oh!

spackbot-app bot commented Nov 28, 2022

Uh oh!

tgamblin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalazo commented Dec 6, 2022

Uh oh!

alalazo commented Dec 6, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgamblin commented Dec 7, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	elif self.accept(DEP):
	if not specs:
	# We're parsing an anonymous spec beginning with a
	# dependency. Push the token to recover after creating
	# anonymous spec
	self.push_tokens([self.token])
	specs.append(self.spec(None))
	else:
	dep = None
	if self.accept(FILE):
	# this may return None, in which case we backtrack
	dep = self.spec_from_file()

	if not dep and self.accept(HASH):
	# We're finding a dependency by hash for an
	# anonymous spec
	dep = self.spec_by_hash()
	dep = dep.copy(deps=("link", "run"))

	if not dep:
	# We're adding a dependency to the last spec
	if self.accept(ID):
	self.previous = self.token
	if self.accept(EQ):

Conversation

alalazo commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Differences with the previous parser

Error handling

Performance

Modifications

Possible future improvements

Uh oh!

spackbot-app bot commented Nov 28, 2022

Uh oh!

spackbot-app bot commented Nov 28, 2022

Uh oh!

tgamblin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalazo commented Dec 6, 2022

Footnotes

Uh oh!

alalazo commented Dec 6, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgamblin commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alalazo commented Nov 28, 2022 •

edited

Loading

tgamblin commented Dec 7, 2022 •

edited

Loading