feat(regex_parser): Implement RegExp parser#3824
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
regexpp for OXCregexpp
CodSpeed Performance ReportMerging #3824 will not alter performanceComparing Summary
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
Now that we have some of the implementation working, we should think about how to support the regex eslint rules 🤔 |
regexppRegExp parser
|
@leaysgur hello, im currently working on a smaller version of regex groups, maybe u find some usefull snippets here: interesting method:
|
|
This is awesome, I'm looking forward to this PR😍 I always had a theory that there are only 5 people on StackOverflow who write all the regex examples and everyone else just copies them into production. If that theory is correct I bet you'd be the 6th after this😆 |
|
That's the truth. 😅 |
This comment was marked as outdated.
This comment was marked as outdated.
|
I hope it becomes an independent crate package. |
|
@Boshen ^ How do you think? (maybe also related to #4242 (comment)) |
|
Hello @leaysgur for See: Did you considered this use case for escaped backslashes? |
|
@Sysix Thanks for your comment!
The current AST for backref already holds
Yes, but as a RegExp parser, I do not specifically address backslash escaping (rather escape sequences). For now, the treatment of pattern My understanding may be wrong and I'm not sure how OXC parser handle these escapes. 😅 Nope… For this reason, we may need to add new flag and implement a lexer layer to check Or just leave it user land to be called I’m beginning to think about this. 🤔 Hmmm, not so sure. I think I'll wait for @Boshen 's advice. This is summary what need to ask:
|
219fe7e to
e95c600
Compare
Merge activity
|
Part of #1164 ## Progress updates 🗞️ Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`. ## TODOs - [x] `RegExp(Literal = Body + Flags)#parse()` structure - [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units - [x] Global `Span` and local offset conversion - [x] Design AST shapes - [x] Keep `enum` size small by `Box<'a, T>` - [x] Rework AST shapes - [x] Split body and flags w/ validating literal - [x] Parse `RegExpFlags` - [x] Parse `RegExpBody` = `Pattern` - [x] Parse `Pattern` > `Disjunction` - [x] Parse `Disjunction` > `Alternative` - [x] Parse `Alternative` > `Term` - [x] Parse `Term` > `Assertion` - [x] Parse `BoundaryAssertion` - [x] Parse `LookaroundAssertion` - [x] Parse `Term` > `Quantifier` - [x] Parse `Term` > `Atom` - [x] Parse `Atom` > `PatternCharacter` - [x] Parse `Atom` > `.` - [x] Parse `Atom` > `\AtomEscape` - [x] Parse `\AtomEscape` > `DecimalEscape` - [x] Parse `\AtomEscape` > `CharacterClassEscape` - [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W` - [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}` - [x] Parse `\AtomEscape` > `CharacterEscape` - [x] Parse `CharacterEscape` > `ControlEscape` - [x] Parse `CharacterEscape` > `c AsciiLetter` - [x] Parse `CharacterEscape` > `0` - [x] Parse `CharacterEscape` > `HexEscapeSequence` - [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence` - [x] Parse `CharacterEscape` > `IdentityEscape` - [x] Parse `\AtomEscape` > `kGroupName` - [x] Parse `Atom` > `[CharacterClass]` - [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges` - [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression` - [x] Parse `ClassSetExpression` > `ClassUnion` - [x] Parse `ClassSetExpression` > `ClassIntersection` - [x] Parse `ClassSetExpression` > `ClassSubtraction` - [x] Parse `ClassSetExpression` > `ClassSetOperand` - [x] Parse `ClassSetExpression` > `ClassSetRange` - [x] Parse `ClassSetExpression` > `ClassSetCharacter` - [x] Parse `Atom` > `(GroupSpecifier)` - [x] Parse `Atom` > `(?:Disjunction)` - [x] Annex B - [x] Parse `QuantifiableAssertion` - [x] Parse `ExtendedAtom` - [x] Parse `ExtendedAtom` > `\ [lookahead = c]` - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier` - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter` - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence` - [x] Early errors - [x] Pattern :: Disjunction(1/2) - [x] Pattern :: Disjunction(2/2) - [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits } - [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B) - [x] AtomEscape :: k GroupName - [x] AtomEscape :: DecimalEscape - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B) - [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2) - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2) - [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression } - [x] CharacterClass :: [^ ClassContents ] - [x] NestedClass :: [^ ClassContents ] - [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter - [x] Add `Span` to `Err(OxcDiagnostic::error())` calls - [x] Perf improvement - [x] `Reader#peek()` should avoid `iter.next()` equivalent - [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~ - [x] ~~Try 1(+1) loop parsing for capturing groups?~~ ## Follow up - [x] @Boshen Test suite > #4242 - [x] Investigate CI errors... - Next... - Support ES2025 Duplicate named capturing groups? - Support ES20XX Stage3 Modifiers?
|
@Sysix Sorry to bother you from already closed PR. I finally found that we do not need to care about escaped backslash issue you mentioned. Please see But you may still need to wait a little longer to use this in linter. #1164 (comment) |
Part of #1164
Progress updates 🗞️
Waiting for the review and advice, while thinking how to handle escaped string when
new RegExp(pat).TODOs
RegExp(Literal = Body + Flags)#parse()structureReaderimpl to handle both unicode(u32) and utf-16(u16) unitsSpanand local offset conversionenumsize small byBox<'a, T>RegExpFlagsRegExpBody=PatternPattern>DisjunctionDisjunction>AlternativeAlternative>TermTerm>AssertionBoundaryAssertionLookaroundAssertionTerm>QuantifierTerm>AtomAtom>PatternCharacterAtom>.Atom>\AtomEscape\AtomEscape>DecimalEscape\AtomEscape>CharacterClassEscapeCharacterClassEscape>\d, \D, \s, \S, \w, \WCharacterClassEscape>\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}\AtomEscape>CharacterEscapeCharacterEscape>ControlEscapeCharacterEscape>c AsciiLetterCharacterEscape>0CharacterEscape>HexEscapeSequenceCharacterEscape>RegExpUnicodeEscapeSequenceCharacterEscape>IdentityEscape\AtomEscape>kGroupNameAtom>[CharacterClass][CharacterClass]>ClassContents>[~UnicodeSetsMode] NonemptyClassRanges[CharacterClass]>ClassContents>[+UnicodeSetsMode] ClassSetExpressionClassSetExpression>ClassUnionClassSetExpression>ClassIntersectionClassSetExpression>ClassSubtractionClassSetExpression>ClassSetOperandClassSetExpression>ClassSetRangeClassSetExpression>ClassSetCharacterAtom>(GroupSpecifier)Atom>(?:Disjunction)QuantifiableAssertionExtendedAtomExtendedAtom>\ [lookahead = c]ExtendedAtom>InvalidBracedQuantifierExtendedAtom>ExtendedPatternCharacterExtendedAtom>\AtomEscape>CharacterEscape>LegacyOctalEscapeSequenceSpantoErr(OxcDiagnostic::error())callsReader#peek()should avoiditer.next()equivalentUsechareverywhere and split and push 2 surrogates(pair) forCharacter?Try 1(+1) loop parsing for capturing groups?Follow up