Skip to content

remove shlex.split() preprocessing from the spec lexer/parser#33016

Closed
cosmicexplorer wants to merge 15 commits intospack:developfrom
cosmicexplorer:remove-shlex-lex-preprocessing
Closed

remove shlex.split() preprocessing from the spec lexer/parser#33016
cosmicexplorer wants to merge 15 commits intospack:developfrom
cosmicexplorer:remove-shlex-lex-preprocessing

Conversation

@cosmicexplorer
Copy link
Copy Markdown
Contributor

This change is on top of #29093; see the diff against it at https://github.com/cosmicexplorer/spack/compare/arbitrary-lexer-modes...remove-shlex-lex-preprocessing?expand=1.

Problem

See discussion at #29093 (comment). We currently call shlex.split() before lexing spec strings, and this is the main reason why calling str() on a spec instance often produces an unparseable string. We would like to be able to call e.g. Spec(str(Spec('a b="c d"'))) and get the same spec. Unfortunately, on develop that produces the following:

>>> Spec(str(Spec('a b="c d"')))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/cosmicexplorer/tools/s1/lib/spack/spack/spec.py", line 1262, in __init__
    raise ValueError("More than one spec in string: " + spec_like)
ValueError: More than one spec in string: a b=c d

Solution

  • Build on the lexer refactoring from cleanup spack.parse.Lexer #29093 to define many new lexer modes to handle every part of the input string without relying on shlex.split().
  • Remove the call to shlex.split() in our spec parsing entry point, and move the work of interpreting quotes to the lexer.

Result

>>> Spec(str(Spec('a b="c d"')))
a b='c d'

As discussed in #29093 (comment), this change induces a minor performance regression (this PR even more so than #29093). The regression seems very slight, so I'd like to ignore it, but we'd need to get consensus that this change is enough of a UX improvement to justify that.

It also may be possible to optimize this further; I have not focused on optimizing this approach at all yet.

@cosmicexplorer cosmicexplorer requested a review from trws October 5, 2022 14:55
@cosmicexplorer
Copy link
Copy Markdown
Contributor Author

cc @psakievich @trws: would love your input on any further annoyances with spec parsing that I may have missed. I kinda punted on trying to interpret inner quotes, so any examples of that would be useful as well. My goal is to demonstrate that this approach can be extended to solve lingering UX issues with spec parsing.

@spackbot-app spackbot-app bot added core PR affects Spack core functionality tests General test capability(ies) versions labels Oct 5, 2022
@cosmicexplorer
Copy link
Copy Markdown
Contributor Author

@trws noted that we still join specs from the command line with " ".join() in cmd/spec.py, and proposed some inputs that we'd like to ensure don't get clobbered from the initial spec parsing:

on the command line: umpire%[email protected] cxxflags="-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"
on the command line: 'umpire%[email protected] cxxflags="-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"'
those two should be equivalent
on the command line: umpire%[email protected] "cxxflags=-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"
that one too
those should work the same if specified as a string inside spack as well
and should be equivalent to ["umpire%[email protected]", "cxxflags=-stdlib=libc++ -DGTEST_HAS_CXXABI_H_=0"]

I'll make test cases out of these.

@psakievich
Copy link
Copy Markdown
Contributor

@cosmicexplorer I'm a little late at getting up to speed on all your recent work. Looks good so far from what I can tell. I see you've absorbed the COLON token into the version regex. IIRC the COMMA token may be eligible for the same treatment, unless a new use case has been added through these refactors. I was playing around with this in #32257 and at one point I had removed both tokens (see).

It seemed to work with a single string (passed the tests for spec_syntax.py, but definitely broke with shlex lexing. Since you are removing that it could be back in play

@cosmicexplorer
Copy link
Copy Markdown
Contributor Author

cosmicexplorer commented Oct 5, 2022

@psakievich thank you! I had actually removed the COMMA earlier (and indeed, it is not used when parsing versions now), but brought it back to cover multiple-valued variants which may or may not be quoted (e.g. a='b',c). That would be the new use case you mentioned:

unless a new use case has been added through these refactors

There has definitely been an explosion of lexer modes in this PR (e.g. four modes for different types of single and double quoting), and it would probably improve readability and performance if we could slim those down somewhat, but I think I'm focusing on completing the functionality first (see Tom's spec inputs above).

@psakievich
Copy link
Copy Markdown
Contributor

That makes sense. I'll try to give this a closer look later today and see if I can come up with any more suggestions. Personally, I need a quiet room and a security blanket close at hand when dealing with Spack's parser. 😁

Copy link
Copy Markdown
Contributor

@psakievich psakievich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like these changes, but I am finding myself getting a little lost in the design. A couple of thoughts to improve readability and testability of the code here.

__slots__ = "scanners", "original_mode", "mode", "switchbook"

def __init__(self, lexicon_and_mode_switches):
# type: (List[Tuple[str, _Lexicon, Dict[str, List[Any]]]]) -> None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to not make these commented type definitions into actual types? From a code readability stand point trying to follow the construction of this object, and the implementation in spec.py would be simplified if this conglomeration of types were members inside a class.

We could also test each phase individually in spec.py if we instantiated the objects outside the Lexer.


try:
while self.next:
if self.accept(SINGLE_QUOTE):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always found this section of the code really difficult to follow because of all the nested if/else's.

I'd suggest any new additions should be functions so we get a better description of what is going on through the name. Maybe something like:

def starts_with_quotes(self, quote_type, specs):
   # same code for single and double

def process_initial_quotes(self, specs):
   # replace both new if statements with this function
   starts_with_quotes(SINGLE_QUOTE, specs)
   starts_with_quotes(DOUBLE_QUOTE, specs)

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Feb 21, 2023

Superseded by #34151

@alalazo alalazo closed this Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core PR affects Spack core functionality tests General test capability(ies) versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants