-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Improve RegexCharClass.Analyze for sets with subtraction #72328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze. This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set `[a-z-[aeio]` would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsCharacter classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze. This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set
|
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
joperezr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM. Should we add a unit test that uses subtraction and calls analyze to validate the result?
Also improve Analyze to handle a few more cases
34031e0 to
d57ba6c
Compare
Yup. Added. |
Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze (at least for non-negated sets). This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set [a-z-[aeio] would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.