Skip to content

Conversation

@stephentoub
Copy link
Member

We have a special-code path that exists to optimize a singleline .*?, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.

We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a performance bug in the regex compiler's optimization for lazy quantifiers followed by literals. The bug occurred when handling patterns like .*? (singleline lazy dot-star) followed by a literal, where the compiler was incorrectly passing the wrong node to the IndexOf emission helper.

  • Corrects the node parameter passed to EmitIndexOf from the loop node to the literal node
  • Fixes performance degradation caused by inefficient IndexOfAnyInRange(0, \uFFFF) calls
  • Maintains functional correctness while improving performance for this optimization path

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Did you spot this in some benchmark given it's compiler-only?

@stephentoub stephentoub merged commit 0686df9 into dotnet:main Aug 5, 2025
80 of 82 checks passed
@stephentoub
Copy link
Member Author

Nice. Did you spot this in some benchmark given it's compiler-only?

Yup, the numbers I was getting out made no sense.

@stephentoub stephentoub deleted the fixanysearch branch August 5, 2025 11:01
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Aug 5, 2025
We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
@github-actions github-actions bot locked and limited conversation to collaborators Sep 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants