remove shlex.split() preprocessing from the spec lexer/parser by cosmicexplorer · Pull Request #33016 · spack/spack

cosmicexplorer · 2022-10-05T14:53:46Z

This change is on top of #29093; see the diff against it at https://github.com/cosmicexplorer/spack/compare/arbitrary-lexer-modes...remove-shlex-lex-preprocessing?expand=1.

Problem

See discussion at #29093 (comment). We currently call shlex.split() before lexing spec strings, and this is the main reason why calling str() on a spec instance often produces an unparseable string. We would like to be able to call e.g. Spec(str(Spec('a b="c d"'))) and get the same spec. Unfortunately, on develop that produces the following:

>>> Spec(str(Spec('a b="c d"')))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/cosmicexplorer/tools/s1/lib/spack/spack/spec.py", line 1262, in __init__
    raise ValueError("More than one spec in string: " + spec_like)
ValueError: More than one spec in string: a b=c d

Solution

Build on the lexer refactoring from cleanup spack.parse.Lexer #29093 to define many new lexer modes to handle every part of the input string without relying on shlex.split().
Remove the call to shlex.split() in our spec parsing entry point, and move the work of interpreting quotes to the lexer.

Result

>>> Spec(str(Spec('a b="c d"')))
a b='c d'

As discussed in #29093 (comment), this change induces a minor performance regression (this PR even more so than #29093). The regression seems very slight, so I'd like to ignore it, but we'd need to get consensus that this change is enough of a UX improvement to justify that.

It also may be possible to optimize this further; I have not focused on optimizing this approach at all yet.

cosmicexplorer · 2022-10-05T14:59:10Z

cc @psakievich @trws: would love your input on any further annoyances with spec parsing that I may have missed. I kinda punted on trying to interpret inner quotes, so any examples of that would be useful as well. My goal is to demonstrate that this approach can be extended to solve lingering UX issues with spec parsing.

cosmicexplorer · 2022-10-05T15:40:43Z

@trws noted that we still join specs from the command line with " ".join() in cmd/spec.py, and proposed some inputs that we'd like to ensure don't get clobbered from the initial spec parsing:

on the command line: umpire%[email protected] cxxflags="-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"
on the command line: 'umpire%[email protected] cxxflags="-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"'
those two should be equivalent
on the command line: umpire%[email protected] "cxxflags=-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"
that one too
those should work the same if specified as a string inside spack as well
and should be equivalent to ["umpire%[email protected]", "cxxflags=-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"]

I'll make test cases out of these.

psakievich · 2022-10-05T16:30:44Z

@cosmicexplorer I'm a little late at getting up to speed on all your recent work. Looks good so far from what I can tell. I see you've absorbed the COLON token into the version regex. IIRC the COMMA token may be eligible for the same treatment, unless a new use case has been added through these refactors. I was playing around with this in #32257 and at one point I had removed both tokens (see).

It seemed to work with a single string (passed the tests for spec_syntax.py, but definitely broke with shlex lexing. Since you are removing that it could be back in play

cosmicexplorer · 2022-10-05T16:43:35Z

@psakievich thank you! I had actually removed the COMMA earlier (and indeed, it is not used when parsing versions now), but brought it back to cover multiple-valued variants which may or may not be quoted (e.g. a='b',c). That would be the new use case you mentioned:

unless a new use case has been added through these refactors

There has definitely been an explosion of lexer modes in this PR (e.g. four modes for different types of single and double quoting), and it would probably improve readability and performance if we could slim those down somewhat, but I think I'm focusing on completing the functionality first (see Tom's spec inputs above).

psakievich · 2022-10-05T16:47:13Z

That makes sense. I'll try to give this a closer look later today and see if I can come up with any more suggestions. Personally, I need a quiet room and a security blanket close at hand when dealing with Spack's parser. 😁

psakievich

I really like these changes, but I am finding myself getting a little lost in the design. A couple of thoughts to improve readability and testability of the code here.

psakievich · 2022-10-13T11:43:55Z

lib/spack/spack/parse.py

+    __slots__ = "scanners", "original_mode", "mode", "switchbook"
+
+    def __init__(self, lexicon_and_mode_switches):
+        # type: (List[Tuple[str, _Lexicon, Dict[str, List[Any]]]]) -> None


Is there any reason to not make these commented type definitions into actual types? From a code readability stand point trying to follow the construction of this object, and the implementation in spec.py would be simplified if this conglomeration of types were members inside a class.

We could also test each phase individually in spec.py if we instantiated the objects outside the Lexer.

psakievich · 2022-10-13T12:18:27Z

lib/spack/spack/spec.py


        try:
            while self.next:
+                if self.accept(SINGLE_QUOTE):


I've always found this section of the code really difficult to follow because of all the nested if/else's.

I'd suggest any new additions should be functions so we get a better description of what is going on through the name. Maybe something like:

def starts_with_quotes(self, quote_type, specs): # same code for single and double def process_initial_quotes(self, specs): # replace both new if statements with this function starts_with_quotes(SINGLE_QUOTE, specs) starts_with_quotes(DOUBLE_QUOTE, specs)

alalazo · 2023-02-21T06:43:25Z

Superseded by #34151

cosmicexplorer added 15 commits October 5, 2022 10:29

Make it possible to have more than one lexer mode at once!

5c6f61c

Enable named instead of numeric lexer modes!

224aad8

add types to Lexer

7df9999

make parser operate on a token stream

10e51cb

remove extraneous self.token(...) calls in SpecLexer.__init__()

4eec16b

use re.escape() for most tokens instead of manual escapes

a17b09c

add back missing VER token and run black

a994d1b

break out confusing lines of code into their own variables and fix bug

d49b1a1

fix py2 compat weirdness with type decl

0a4f972

add a lot more documentation of how we construct/execute the scanner

98c9386

fix weird mypy error

b2df9df

remove shlex.split from spec parsing!!!

531747f

make spec serialization idempotent (!!!!)

d61d454

add test to demonstrate the specific benefit this change provides

f3256b0

make style pass

70ad424

cosmicexplorer requested a review from trws October 5, 2022 14:55

spackbot-app bot added core PR affects Spack core functionality tests General test capability(ies) versions labels Oct 5, 2022

cosmicexplorer mentioned this pull request Oct 5, 2022

cleanup spack.parse.Lexer #29093

Closed

tldahlgren assigned psakievich Oct 10, 2022

psakievich reviewed Oct 13, 2022

View reviewed changes

alalazo closed this Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove shlex.split() preprocessing from the spec lexer/parser#33016

remove shlex.split() preprocessing from the spec lexer/parser#33016
cosmicexplorer wants to merge 15 commits intospack:developfrom
cosmicexplorer:remove-shlex-lex-preprocessing

cosmicexplorer commented Oct 5, 2022

Uh oh!

cosmicexplorer commented Oct 5, 2022

Uh oh!

cosmicexplorer commented Oct 5, 2022

Uh oh!

psakievich commented Oct 5, 2022

Uh oh!

cosmicexplorer commented Oct 5, 2022 •

edited

Loading

Uh oh!

psakievich commented Oct 5, 2022

Uh oh!

psakievich left a comment

Uh oh!

psakievich Oct 13, 2022

Uh oh!

psakievich Oct 13, 2022

Uh oh!

alalazo commented Feb 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cosmicexplorer commented Oct 5, 2022

Problem

Solution

Result

Uh oh!

cosmicexplorer commented Oct 5, 2022

Uh oh!

cosmicexplorer commented Oct 5, 2022

Uh oh!

psakievich commented Oct 5, 2022

Uh oh!

cosmicexplorer commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psakievich commented Oct 5, 2022

Uh oh!

psakievich left a comment

Choose a reason for hiding this comment

Uh oh!

psakievich Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

psakievich Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

alalazo commented Feb 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cosmicexplorer commented Oct 5, 2022 •

edited

Loading