Skip to content

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Jul 17, 2022

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze (at least for non-negated sets). This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set [a-z-[aeio] would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction.  Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match).  But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze.  This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze.  For example, today the set `[a-z-[aeio]` would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed.  Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.
@stephentoub stephentoub added this to the 7.0.0 milestone Jul 17, 2022
@ghost ghost assigned stephentoub Jul 17, 2022
@ghost
Copy link

ghost commented Jul 17, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze. This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set [a-z-[aeio] would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@stephentoub stephentoub requested a review from joperezr July 20, 2022 16:37
Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM. Should we add a unit test that uses subtraction and calls analyze to validate the result?

Also improve Analyze to handle a few more cases
@stephentoub stephentoub force-pushed the analysissubtraction branch from 34031e0 to d57ba6c Compare July 21, 2022 03:26
@stephentoub
Copy link
Member Author

Should we add a unit test that uses subtraction and calls analyze to validate the result?

Yup. Added.

@stephentoub stephentoub merged commit c2f07f6 into dotnet:main Jul 21, 2022
@stephentoub stephentoub deleted the analysissubtraction branch July 21, 2022 18:11
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants