Skip to content

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented May 5, 2022

This PR causes regex to now specially-recognize additional categories that map to sets char already has IsXx methods for and call them, e.g. char.IsControl, char.IsLetter, etc.

Example:

[RegexGenerator(@"\p{C}\P{C}\p{L}\P{L}[\p{L}\d][^\p{L}\d]\p{Ll}\P{Ll}\p{Lu}\P{Lu}\p{N}\P{N}\p{P}\P{P}\p{Z}\P{Z}\p{S}\P{S}")]

previously resulted in:

if ((uint)slice.Length < 18 ||
    (char.GetUnicodeCategory(slice[0]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => false, _ => true }) || // Match a character in the set [\p{C}].
    (char.GetUnicodeCategory(slice[1]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => true, _ => false }) || // Match a character in the set [\P{C}].
    (char.GetUnicodeCategory(slice[2]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => false, _ => true }) || // Match a character in the set [\p{L}].
    (char.GetUnicodeCategory(slice[3]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => true, _ => false }) || // Match a character in the set [\P{L}].
    (char.GetUnicodeCategory(slice[4]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => false, _ => true }) || // Match a character in the set [\p{L}\d].
    (char.GetUnicodeCategory(slice[5]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => true, _ => false }) || // Match a character in the set [^\p{L}\d].
    (char.GetUnicodeCategory(slice[6]) != UnicodeCategory.LowercaseLetter) || // Match a character in the set [\p{Ll}].
    (char.GetUnicodeCategory(slice[7]) == UnicodeCategory.LowercaseLetter) || // Match a character in the set [\P{Ll}].
    (char.GetUnicodeCategory(slice[8]) != UnicodeCategory.UppercaseLetter) || // Match a character in the set [\p{Lu}].
    (char.GetUnicodeCategory(slice[9]) == UnicodeCategory.UppercaseLetter) || // Match a character in the set [\P{Lu}].
    (char.GetUnicodeCategory(slice[10]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => false, _ => true }) || // Match a character in the set [\p{N}].
    (char.GetUnicodeCategory(slice[11]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => true, _ => false }) || // Match a character in the set [\P{N}].
    (char.GetUnicodeCategory(slice[12]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => false, _ => true }) || // Match a character in the set [\p{P}].
    (char.GetUnicodeCategory(slice[13]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => true, _ => false }) || // Match a character in the set [\P{P}].
    (char.GetUnicodeCategory(slice[14]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => false, _ => true }) || // Match a character in the set [\p{Z}].
    (char.GetUnicodeCategory(slice[15]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => true, _ => false }) || // Match a character in the set [\P{Z}].
    (char.GetUnicodeCategory(slice[16]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => false, _ => true }) || // Match a character in the set [\p{S}].
    (char.GetUnicodeCategory(slice[17]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => true, _ => false })) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

and now results in:

if ((uint)slice.Length < 18 ||
    !char.IsControl(slice[0]) || // Match a character in the set [\p{C}].
    char.IsControl(slice[1]) || // Match a character in the set [\P{C}].
    !char.IsLetter(slice[2]) || // Match a character in the set [\p{L}].
    char.IsLetter(slice[3]) || // Match a character in the set [\P{L}].
    !char.IsLetterOrDigit(slice[4]) || // Match a character in the set [\p{L}\d].
    char.IsLetterOrDigit(slice[5]) || // Match a character in the set [^\p{L}\d].
    !char.IsLower(slice[6]) || // Match a character in the set [\p{Ll}].
    char.IsLower(slice[7]) || // Match a character in the set [\P{Ll}].
    !char.IsUpper(slice[8]) || // Match a character in the set [\p{Lu}].
    char.IsUpper(slice[9]) || // Match a character in the set [\P{Lu}].
    !char.IsNumber(slice[10]) || // Match a character in the set [\p{N}].
    char.IsNumber(slice[11]) || // Match a character in the set [\P{N}].
    !char.IsPunctuation(slice[12]) || // Match a character in the set [\p{P}].
    char.IsPunctuation(slice[13]) || // Match a character in the set [\P{P}].
    !char.IsSeparator(slice[14]) || // Match a character in the set [\p{Z}].
    char.IsSeparator(slice[15]) || // Match a character in the set [\P{Z}].
    !char.IsSymbol(slice[16]) || // Match a character in the set [\p{S}].
    char.IsSymbol(slice[17])) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

@stephentoub stephentoub added this to the 7.0.0 milestone May 5, 2022
@stephentoub stephentoub requested a review from joperezr May 5, 2022 17:43
@ghost ghost assigned stephentoub May 5, 2022
@ghost
Copy link

ghost commented May 5, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR causes regex to now specially-recognize additional categories that map to sets char already has IsXx methods for and call them, e.g. char.IsControl, char.IsLetter, etc.

Example:

[RegexGenerator(@"\p{C}\P{C}\p{L}\P{L}[\p{L}\d][^\p{L}\d]\p{Ll}\P{Ll}\p{Lu}\P{Lu}\p{N}\P{N}\p{P}\P{P}\p{Z}\P{Z}\p{S}\P{S}")]

previously resulted in:

if ((uint)slice.Length < 18 ||
    (char.GetUnicodeCategory(slice[0]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => false, _ => true }) || // Match a character in the set [\p{C}].
    (char.GetUnicodeCategory(slice[1]) switch { UnicodeCategory.Control or UnicodeCategory.Format or UnicodeCategory.OtherNotAssigned or UnicodeCategory.PrivateUse or UnicodeCategory.Surrogate => true, _ => false }) || // Match a character in the set [\P{C}].
    (char.GetUnicodeCategory(slice[2]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => false, _ => true }) || // Match a character in the set [\p{L}].
    (char.GetUnicodeCategory(slice[3]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter => true, _ => false }) || // Match a character in the set [\P{L}].
    (char.GetUnicodeCategory(slice[4]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => false, _ => true }) || // Match a character in the set [\p{L}\d].
    (char.GetUnicodeCategory(slice[5]) switch { UnicodeCategory.LowercaseLetter or UnicodeCategory.ModifierLetter or UnicodeCategory.OtherLetter or UnicodeCategory.TitlecaseLetter or UnicodeCategory.UppercaseLetter or UnicodeCategory.DecimalDigitNumber => true, _ => false }) || // Match a character in the set [^\p{L}\d].
    (char.GetUnicodeCategory(slice[6]) != UnicodeCategory.LowercaseLetter) || // Match a character in the set [\p{Ll}].
    (char.GetUnicodeCategory(slice[7]) == UnicodeCategory.LowercaseLetter) || // Match a character in the set [\P{Ll}].
    (char.GetUnicodeCategory(slice[8]) != UnicodeCategory.UppercaseLetter) || // Match a character in the set [\p{Lu}].
    (char.GetUnicodeCategory(slice[9]) == UnicodeCategory.UppercaseLetter) || // Match a character in the set [\P{Lu}].
    (char.GetUnicodeCategory(slice[10]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => false, _ => true }) || // Match a character in the set [\p{N}].
    (char.GetUnicodeCategory(slice[11]) switch { UnicodeCategory.DecimalDigitNumber or UnicodeCategory.LetterNumber or UnicodeCategory.OtherNumber => true, _ => false }) || // Match a character in the set [\P{N}].
    (char.GetUnicodeCategory(slice[12]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => false, _ => true }) || // Match a character in the set [\p{P}].
    (char.GetUnicodeCategory(slice[13]) switch { UnicodeCategory.ConnectorPunctuation or UnicodeCategory.DashPunctuation or UnicodeCategory.ClosePunctuation or UnicodeCategory.OtherPunctuation or UnicodeCategory.OpenPunctuation or UnicodeCategory.FinalQuotePunctuation or UnicodeCategory.InitialQuotePunctuation => true, _ => false }) || // Match a character in the set [\P{P}].
    (char.GetUnicodeCategory(slice[14]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => false, _ => true }) || // Match a character in the set [\p{Z}].
    (char.GetUnicodeCategory(slice[15]) switch { UnicodeCategory.LineSeparator or UnicodeCategory.ParagraphSeparator or UnicodeCategory.SpaceSeparator => true, _ => false }) || // Match a character in the set [\P{Z}].
    (char.GetUnicodeCategory(slice[16]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => false, _ => true }) || // Match a character in the set [\p{S}].
    (char.GetUnicodeCategory(slice[17]) switch { UnicodeCategory.CurrencySymbol or UnicodeCategory.ModifierSymbol or UnicodeCategory.MathSymbol or UnicodeCategory.OtherSymbol => true, _ => false })) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}

and now results in:

if ((uint)slice.Length < 18 ||
    !char.IsControl(slice[0]) || // Match a character in the set [\p{C}].
    char.IsControl(slice[1]) || // Match a character in the set [\P{C}].
    !char.IsLetter(slice[2]) || // Match a character in the set [\p{L}].
    char.IsLetter(slice[3]) || // Match a character in the set [\P{L}].
    !char.IsLetterOrDigit(slice[4]) || // Match a character in the set [\p{L}\d].
    char.IsLetterOrDigit(slice[5]) || // Match a character in the set [^\p{L}\d].
    !char.IsLower(slice[6]) || // Match a character in the set [\p{Ll}].
    char.IsLower(slice[7]) || // Match a character in the set [\P{Ll}].
    !char.IsUpper(slice[8]) || // Match a character in the set [\p{Lu}].
    char.IsUpper(slice[9]) || // Match a character in the set [\P{Lu}].
    !char.IsNumber(slice[10]) || // Match a character in the set [\p{N}].
    char.IsNumber(slice[11]) || // Match a character in the set [\P{N}].
    !char.IsPunctuation(slice[12]) || // Match a character in the set [\p{P}].
    char.IsPunctuation(slice[13]) || // Match a character in the set [\P{P}].
    !char.IsSeparator(slice[14]) || // Match a character in the set [\p{Z}].
    char.IsSeparator(slice[15]) || // Match a character in the set [\P{Z}].
    !char.IsSymbol(slice[16]) || // Match a character in the set [\p{S}].
    !char.IsSymbol(slice[17])) // Match a character in the set [\P{S}].
{
    return false; // The input didn't match.
}
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@stephentoub
Copy link
Member Author

We'll subsequently want to use any new char.IsXx methods introduced based on #68868.

This PR causes regex to now specially-recognize additional categories that map to sets `char` already has `IsXx` methods for and call them, e.g. `char.IsControl`, `char.IsLetter`, etc.
Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I assume we already have tests for all of these constructs for the compiled and source generated engines?

@stephentoub
Copy link
Member Author

stephentoub commented May 5, 2022

Yes, though several are outerloop.

@stephentoub stephentoub merged commit 475c8a8 into dotnet:main May 5, 2022
@stephentoub stephentoub deleted the morecharcalls branch May 5, 2022 21:52
@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants