Skip to content

Conversation

@stephentoub
Copy link
Member

I spent some time exploring combining all of the TryFindXx and TryMatchXx methods in RegexCompiler / source generator into just a single Scan. It doesn't yet appear to be worth doing, but along the way I tweaked a bunch of things, and am submitting that as a PR separately.

Functional:

  • We were throwing NotSupportedException when RegexOptions.NonBacktracking was used in combination with other supported options, but for existing incompatible options we throw ArgumentOutOfRangeException, so we should be doing the same there. We should also do it in the helper where we validate options rather than in the NonBacktracking factory.

Performance:

  • Given a pattern like "\w*-", we'll use our literal-after-loop find optimization where we search for the '-' and then walk backwards through the atomic loop looking for where it starts. However, we then record that start-of-loop location as the starting position, and try to match from there, which means we end up then rematching that whole loop again. If the loop is long, that can be non-trivial wasted work. We now store the position after the loop into a field and use that from the matching logic to skip the loop as part of the match.
  • We now emit specialized code in Scan for a few common structures, e.g. if the pattern begins with a beginning anchor, we don't need a scan loop at all.

Most everything else is for code readability / maintenance.

@stephentoub stephentoub added this to the 7.0.0 milestone Mar 14, 2022
@stephentoub stephentoub requested a review from joperezr March 14, 2022 02:46
@ghost ghost assigned stephentoub Mar 14, 2022
@ghost
Copy link

ghost commented Mar 14, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

I spent some time exploring combining all of the TryFindXx and TryMatchXx methods in RegexCompiler / source generator into just a single Scan. It doesn't yet appear to be worth doing, but along the way I tweaked a bunch of things, and am submitting that as a PR separately.

Functional:

  • We were throwing NotSupportedException when RegexOptions.NonBacktracking was used in combination with other supported options, but for existing incompatible options we throw ArgumentOutOfRangeException, so we should be doing the same there. We should also do it in the helper where we validate options rather than in the NonBacktracking factory.

Performance:

  • Given a pattern like "\w*-", we'll use our literal-after-loop find optimization where we search for the '-' and then walk backwards through the atomic loop looking for where it starts. However, we then record that start-of-loop location as the starting position, and try to match from there, which means we end up then rematching that whole loop again. If the loop is long, that can be non-trivial wasted work. We now store the position after the loop into a field and use that from the matching logic to skip the loop as part of the match.
  • We now emit specialized code in Scan for a few common structures, e.g. if the pattern begins with a beginning anchor, we don't need a scan loop at all.

Most everything else is for code readability / maintenance.

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

Copy link
Contributor

@deeprobin deeprobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks overall good to me - Good job!
👍🏼

Comment on lines +140 to +141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it would be useful to add a comment here explaining this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment why supressing the _regexTree nullable warning would be imo useful.
How do we know that _regexTree is not null?

Comment on lines +2361 to +2362
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment why supressing the _regexTree nullable warning would be imo useful.
How do we know that _regexTree is not null?

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some small questions but this looks good otherwise. I like the simplified codegen this produces for most common cases.

@stephentoub
Copy link
Member Author

for most common cases

Yeah, this covers ~1/3rd of the cases in our corpus, so that's decent.

@stephentoub stephentoub merged commit 734362c into dotnet:main Mar 15, 2022
@stephentoub stephentoub deleted the fixemittedcode branch March 15, 2022 00:27
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
* Remove extra blank comment line

EmitScope's usage was resulting in a "//" line being output.  We can just delete all remaining usage of EmitScope; it's no longer helpful.

* Fix some blocks not using EmitBlock

* Minor tweak to add some uint casts to eliminate a few bounds checks

* Avoid rematching loops already matched with the literal-after-loop find optimization

Store the end of the loop position into the unused runtrackpos, and use that to jump right to that position instead of the normal loop handling code.

* Remove unnecessary writer argument to the SliceInputSpan local function

* Make sure we have some RightToLeft culture tests

* Use a more canonical form of loop in EmitLiteralAfterAtomicLoop

* Avoid unnecessary loop in literal-after-atomic-loop for no min bound

* Add more emitted comments to TryFindNextPossibleStartingPosition

* Invert some anchor checks to make the code cleaner

* Remove stray "global::"

* Rename original_pos to matchStart

* Add an additional reduction for concats containing nothings

* Fix exception type thrown for invalid NonBacktracking options combination

* Add special-cased code emitting to EmitScan

* Avoid label in TryFindNextPossibleStartingPosition when it's unnecessary
@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants