Skip to content

Conversation

@stephentoub
Copy link
Member

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic. The optimization is meant to use IndexOf{Any} in a situation like <[^>]*?> where a lazy loop is consuming input until it sees some character. In this example, we'd want to IndexOf('>'), but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up. In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic.  The optimization is meant to use IndexOf{Any} in a situation like `<[^>]*?>` where a lazy loop is consuming input until it sees some character.  In this example, we'd want to `IndexOf('>')`, but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up.  In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).
@stephentoub stephentoub added this to the 7.0.0 milestone Apr 22, 2022
@stephentoub stephentoub requested a review from joperezr April 22, 2022 17:03
@ghost ghost assigned stephentoub Apr 22, 2022
@ghost
Copy link

ghost commented Apr 22, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic. The optimization is meant to use IndexOf{Any} in a situation like <[^>]*?> where a lazy loop is consuming input until it sees some character. In this example, we'd want to IndexOf('>'), but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up. In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we still have #63300 open and I plan to work on it soon, but would it be worth adding some tests on this PR to ensure some pattern examples are using IndexOf as expected? If you don't want to do that now, let's just add a note on that issue so that I remember to add a test case for this PR along the rest of the tests.

@stephentoub
Copy link
Member Author

let's just add a note on that issue so that I remember to add a test case for this PR along the rest of the tests.

Will do. Thanks.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 28, 2022

@ghost ghost locked as resolved and limited conversation to collaborators May 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants