-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Fix perf bug in RegexCompiler when handling .*? #118373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a performance bug in the regex compiler's optimization for lazy quantifiers followed by literals. The bug occurred when handling patterns like .*? (singleline lazy dot-star) followed by a literal, where the compiler was incorrectly passing the wrong node to the IndexOf emission helper.
- Corrects the node parameter passed to
EmitIndexOffrom the loop node to the literal node - Fixes performance degradation caused by inefficient
IndexOfAnyInRange(0, \uFFFF)calls - Maintains functional correctness while improving performance for this optimization path
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
MihaZupan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Did you spot this in some benchmark given it's compiler-only?
Yup, the numbers I was getting out made no sense. |
We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
We have a special-code path that exists to optimize a singleline
.*?, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.