Use more char.Is helpers from RegexCompiler / source generator #68924

stephentoub · 2022-05-05T17:43:51Z

This PR causes regex to now specially-recognize additional categories that map to sets char already has IsXx methods for and call them, e.g. char.IsControl, char.IsLetter, etc.

Example:

[RegexGenerator(@"\p{C}\P{C}\p{L}\P{L}[\p{L}\d][^\p{L}\d]\p{Ll}\P{Ll}\p{Lu}\P{Lu}\p{N}\P{N}\p{P}\P{P}\p{Z}\P{Z}\p{S}\P{S}")]

previously resulted in:

if ((uint)slice.Length < 18 ||
    (char.GetUnicodeCategory(slice[0]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => false, _ => true }) || // Match a character in the set [\p{C}].
    (char.GetUnicodeCategory(slice[1]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => true, _ => false }) || // Match a character in the set [\P{C}].
    (char.GetUnicodeCategory(slice[2]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => false, _ => true }) || // Match a character in the set [\p{L}].
    (char.GetUnicodeCategory(slice[3]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => true, _ => false }) || // Match a character in the set [\P{L}].
    (char.GetUnicodeCategory(slice[4]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => false, _ => true }) || // Match a character in the set [\p{L}\d].
    (char.GetUnicodeCategory(slice[5]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => true, _ => false }) || // Match a character in the set [^\p{L}\d].
    (char.GetUnicodeCategory(slice[6]) != UnicodeCategory.LowercaseLetter) || // Match a character in the set [\p{Ll}].
    (char.GetUnicodeCategory(slice[7]) == UnicodeCategory.LowercaseLetter) || // Match a character in the set [\P{Ll}].
    (char.GetUnicodeCategory(slice[8]) != UnicodeCategory.UppercaseLetter) || // Match a character in the set [\p{Lu}].
    (char.GetUnicodeCategory(slice[9]) == UnicodeCategory.UppercaseLetter) || // Match a character in the set [\P{Lu}].
    (char.GetUnicodeCategory(slice[10]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => false, _ => true }) || // Match a character in the set [\p{N}].
    (char.GetUnicodeCategory(slice[11]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => true, _ => false }) || // Match a character in the set [\P{N}].
    (char.GetUnicodeCategory(slice[12]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => false, _ => true }) || // Match a character in the set [\p{P}].
    (char.GetUnicodeCategory(slice[13]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => true, _ => false }) || // Match a character in the set [\P{P}].
    (char.GetUnicodeCategory(slice[14]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => false, _ => true }) || // Match a character in the set [\p{Z}].
    (char.GetUnicodeCategory(slice[15]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => true, _ => false }) || // Match a character in the set [\P{Z}].
    (char.GetUnicodeCategory(slice[16]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => false, _ => true }) || // Match a character in the set [\p{S}].
    (char.GetUnicodeCategory(slice[17]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => true, _ => false })) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

and now results in:

if ((uint)slice.Length < 18 ||
    !char.IsControl(slice[0]) || // Match a character in the set [\p{C}].
    char.IsControl(slice[1]) || // Match a character in the set [\P{C}].
    !char.IsLetter(slice[2]) || // Match a character in the set [\p{L}].
    char.IsLetter(slice[3]) || // Match a character in the set [\P{L}].
    !char.IsLetterOrDigit(slice[4]) || // Match a character in the set [\p{L}\d].
    char.IsLetterOrDigit(slice[5]) || // Match a character in the set [^\p{L}\d].
    !char.IsLower(slice[6]) || // Match a character in the set [\p{Ll}].
    char.IsLower(slice[7]) || // Match a character in the set [\P{Ll}].
    !char.IsUpper(slice[8]) || // Match a character in the set [\p{Lu}].
    char.IsUpper(slice[9]) || // Match a character in the set [\P{Lu}].
    !char.IsNumber(slice[10]) || // Match a character in the set [\p{N}].
    char.IsNumber(slice[11]) || // Match a character in the set [\P{N}].
    !char.IsPunctuation(slice[12]) || // Match a character in the set [\p{P}].
    char.IsPunctuation(slice[13]) || // Match a character in the set [\P{P}].
    !char.IsSeparator(slice[14]) || // Match a character in the set [\p{Z}].
    char.IsSeparator(slice[15]) || // Match a character in the set [\P{Z}].
    !char.IsSymbol(slice[16]) || // Match a character in the set [\p{S}].
    char.IsSymbol(slice[17])) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

ghost · 2022-05-05T17:43:58Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR causes regex to now specially-recognize additional categories that map to sets char already has IsXx methods for and call them, e.g. char.IsControl, char.IsLetter, etc.

Example:

[RegexGenerator(@"\p{C}\P{C}\p{L}\P{L}[\p{L}\d][^\p{L}\d]\p{Ll}\P{Ll}\p{Lu}\P{Lu}\p{N}\P{N}\p{P}\P{P}\p{Z}\P{Z}\p{S}\P{S}")]

previously resulted in:

if ((uint)slice.Length < 18 ||
    (char.GetUnicodeCategory(slice[0]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => false, _ => true }) || // Match a character in the set [\p{C}].
    (char.GetUnicodeCategory(slice[1]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => true, _ => false }) || // Match a character in the set [\P{C}].
    (char.GetUnicodeCategory(slice[2]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => false, _ => true }) || // Match a character in the set [\p{L}].
    (char.GetUnicodeCategory(slice[3]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => true, _ => false }) || // Match a character in the set [\P{L}].
    (char.GetUnicodeCategory(slice[4]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => false, _ => true }) || // Match a character in the set [\p{L}\d].
    (char.GetUnicodeCategory(slice[5]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => true, _ => false }) || // Match a character in the set [^\p{L}\d].
    (char.GetUnicodeCategory(slice[6]) != UnicodeCategory.LowercaseLetter) || // Match a character in the set [\p{Ll}].
    (char.GetUnicodeCategory(slice[7]) == UnicodeCategory.LowercaseLetter) || // Match a character in the set [\P{Ll}].
    (char.GetUnicodeCategory(slice[8]) != UnicodeCategory.UppercaseLetter) || // Match a character in the set [\p{Lu}].
    (char.GetUnicodeCategory(slice[9]) == UnicodeCategory.UppercaseLetter) || // Match a character in the set [\P{Lu}].
    (char.GetUnicodeCategory(slice[10]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => false, _ => true }) || // Match a character in the set [\p{N}].
    (char.GetUnicodeCategory(slice[11]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => true, _ => false }) || // Match a character in the set [\P{N}].
    (char.GetUnicodeCategory(slice[12]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => false, _ => true }) || // Match a character in the set [\p{P}].
    (char.GetUnicodeCategory(slice[13]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => true, _ => false }) || // Match a character in the set [\P{P}].
    (char.GetUnicodeCategory(slice[14]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => false, _ => true }) || // Match a character in the set [\p{Z}].
    (char.GetUnicodeCategory(slice[15]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => true, _ => false }) || // Match a character in the set [\P{Z}].
    (char.GetUnicodeCategory(slice[16]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => false, _ => true }) || // Match a character in the set [\p{S}].
    (char.GetUnicodeCategory(slice[17]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => true, _ => false })) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

and now results in:

if ((uint)slice.Length < 18 ||
    !char.IsControl(slice[0]) || // Match a character in the set [\p{C}].
    char.IsControl(slice[1]) || // Match a character in the set [\P{C}].
    !char.IsLetter(slice[2]) || // Match a character in the set [\p{L}].
    char.IsLetter(slice[3]) || // Match a character in the set [\P{L}].
    !char.IsLetterOrDigit(slice[4]) || // Match a character in the set [\p{L}\d].
    char.IsLetterOrDigit(slice[5]) || // Match a character in the set [^\p{L}\d].
    !char.IsLower(slice[6]) || // Match a character in the set [\p{Ll}].
    char.IsLower(slice[7]) || // Match a character in the set [\P{Ll}].
    !char.IsUpper(slice[8]) || // Match a character in the set [\p{Lu}].
    char.IsUpper(slice[9]) || // Match a character in the set [\P{Lu}].
    !char.IsNumber(slice[10]) || // Match a character in the set [\p{N}].
    char.IsNumber(slice[11]) || // Match a character in the set [\P{N}].
    !char.IsPunctuation(slice[12]) || // Match a character in the set [\p{P}].
    char.IsPunctuation(slice[13]) || // Match a character in the set [\P{P}].
    !char.IsSeparator(slice[14]) || // Match a character in the set [\p{Z}].
    char.IsSeparator(slice[15]) || // Match a character in the set [\P{Z}].
    !char.IsSymbol(slice[16]) || // Match a character in the set [\p{S}].
    !char.IsSymbol(slice[17])) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

stephentoub · 2022-05-05T17:47:18Z

We'll subsequently want to use any new char.IsXx methods introduced based on #68868.

This PR causes regex to now specially-recognize additional categories that map to sets `char` already has `IsXx` methods for and call them, e.g. `char.IsControl`, `char.IsLetter`, etc.

joperezr

LGTM. I assume we already have tests for all of these constructs for the compiled and source generated engines?

stephentoub · 2022-05-05T21:52:41Z

Yes, though several are outerloop.

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels May 5, 2022

stephentoub added this to the 7.0.0 milestone May 5, 2022

stephentoub requested a review from joperezr May 5, 2022 17:43

ghost assigned stephentoub May 5, 2022

Use more char.Is helpers from RegexCompiler / source generator

9a6f647

This PR causes regex to now specially-recognize additional categories that map to sets `char` already has `IsXx` methods for and call them, e.g. `char.IsControl`, `char.IsLetter`, etc.

stephentoub force-pushed the morecharcalls branch from 500ca1d to 9a6f647 Compare May 5, 2022 17:49

joperezr approved these changes May 5, 2022

View reviewed changes

stephentoub merged commit 475c8a8 into dotnet:main May 5, 2022

stephentoub deleted the morecharcalls branch May 5, 2022 21:52

DrewScoggins mentioned this pull request May 10, 2022

[Perf] Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock #69140

Closed

ghost locked as resolved and limited conversation to collaborators Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use more char.Is helpers from RegexCompiler / source generator #68924

Use more char.Is helpers from RegexCompiler / source generator #68924

Uh oh!

stephentoub commented May 5, 2022 •

edited

Loading

Uh oh!

ghost commented May 5, 2022

Uh oh!

stephentoub commented May 5, 2022

Uh oh!

joperezr left a comment

Uh oh!

stephentoub commented May 5, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use more char.Is helpers from RegexCompiler / source generator #68924

Use more char.Is helpers from RegexCompiler / source generator #68924

Uh oh!

Conversation

stephentoub commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented May 5, 2022

Uh oh!

stephentoub commented May 5, 2022

Uh oh!

joperezr left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephentoub commented May 5, 2022 •

edited

Loading

stephentoub commented May 5, 2022 •

edited

Loading