Improve RegexCharClass.Analyze for sets with subtraction #72328

stephentoub · 2022-07-17T03:56:50Z

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze (at least for non-negated sets). This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set [a-z-[aeio] would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze. This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set `[a-z-[aeio]` would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.

ghost · 2022-07-17T03:57:03Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Character classes containing subtraction are currently skipped in RegexCharClass.Analyze as it depends on CanEasilyEnumerateSetContents, which in turn bails for sets with subtraction. Most of the calls to CanEasilyEnumerateSetContents can't deal with subtraction as they require an exact answer (e.g. GetSetChars needs to enumerate the ranges to yield those and only those characters that match). But Analyze is fine producing an overestimate, and since subtraction can only ever narrow the set of what's accepted, we can simply ignore subtraction in Analyze. This is useful because RegexCompiler and source generator have multiple optimizations that kick in based on the results of Analyze. For example, today the set [a-z-[aeio] would still produce a fall back path for non-ASCII, even though the ranges highlight that the only accepted values are ASCII... with this change, that fallback won't be needed. Similarly, a set with subtraction but only Unicode ranges could now end up satisfying various optimizations, like using a 64-bit lookup table if the range of accepted characters is no larger than that.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

joperezr

Code changes LGTM. Should we add a unit test that uses subtraction and calls analyze to validate the result?

Also improve Analyze to handle a few more cases

stephentoub · 2022-07-21T03:26:39Z

Should we add a unit test that uses subtraction and calls analyze to validate the result?

Yup. Added.

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Jul 17, 2022

stephentoub added this to the 7.0.0 milestone Jul 17, 2022

ghost assigned stephentoub Jul 17, 2022

bradmarder reviewed Jul 17, 2022

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

stephentoub requested a review from joperezr July 20, 2022 16:37

joperezr approved these changes Jul 20, 2022

View reviewed changes

Add RegexCharClass.Analyze unit tests

d57ba6c

Also improve Analyze to handle a few more cases

stephentoub force-pushed the analysissubtraction branch from 34031e0 to d57ba6c Compare July 21, 2022 03:26

joperezr approved these changes Jul 21, 2022

View reviewed changes

stephentoub merged commit c2f07f6 into dotnet:main Jul 21, 2022

stephentoub deleted the analysissubtraction branch July 21, 2022 18:11

ghost locked as resolved and limited conversation to collaborators Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve RegexCharClass.Analyze for sets with subtraction #72328

Improve RegexCharClass.Analyze for sets with subtraction #72328

Uh oh!

stephentoub commented Jul 17, 2022 •

edited

Loading

Uh oh!

ghost commented Jul 17, 2022

Uh oh!

Uh oh!

joperezr left a comment

Uh oh!

stephentoub commented Jul 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve RegexCharClass.Analyze for sets with subtraction #72328

Improve RegexCharClass.Analyze for sets with subtraction #72328

Uh oh!

Conversation

stephentoub commented Jul 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 17, 2022

Uh oh!

Uh oh!

joperezr left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Jul 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephentoub commented Jul 17, 2022 •

edited

Loading