Skip to content

Conversation

@stephentoub
Copy link
Member

In almost every place we use {Last}IndexOf{Any} for a given character or set of characters, we can use {Last}IndexOfAnyExcept for the negated counterpart. This leads to cleaner generated code, and for longer searches, benefits from vectorization. This includes:

  • In set loops of a small number of negated chars, finding the end of the loop with IndexOfAnyExcept.
  • In loops followed by a set of a small number of negated chars, using LastIndexOfAnyExcept to find the next location to which to backtrack.
  • In Singleline .*? lazy loops followed by a set of a small number of negated chars, using IndexOfAnyExcept to find the next location to backtrack to.
  • In unbounded atomic one and set loops, using IndexOfAnyExcept to find the end of the loop

The two places we don't currently use IndexOf* but don't use the *Except variants are:

  • As part of TryFindNextPossibleStartingPosition. There appear to be relatively few patterns in the wild this would benefit, and we'd need custom logic for finding the set of negated starting characters.
  • As part of backtracking in a non-Singleline lazy loop. It needs to search for the target plus a \n, which works well when the target is a non-negated char (as it just adds the char to the set of things to search for) but doesn't when working with a negated char (as then different elements of the search have different polarities).

In almost every place we use {Last}IndexOf{Any} for a given character or set of characters, we can use {Last}IndexOfAnyExcept for the negated counterpart.  This leads to cleaner generated code, and for longer searches, benefits from vectorization.  This includes:
- In set loops of a small number of negated chars, finding the end of the loop with IndexOfAnyExcept.
- In loops followed by a set of a small number of negated chars, using LastIndexOfAnyExcept to find the next location to which to backtrack.
- In Singleline .*? lazy loops followed by a set of a small number of negated chars, using IndexOfAnyExcept to find the next location to backtrack to.
- In unbounded atomic one and set loops, using IndexOfAnyExcept to find the end of the loop

The two places we don't currently use IndexOf* but don't use the *Except variants are:
- As part of TryFindNextPossibleStartingPosition.  There appear to be relatively few patterns in the wild this would benefit, and we'd need custom logic for finding the set of negated starting characters.
- As part of backtracking in a non-Singleline lazy loop.  It needs to search for the target plus a \n, which works well when the target is a non-negated char (as it just adds the char to the set of things to search for) but doesn't when working with a negated char (as then different elements of the search have different polarities).
@ghost
Copy link

ghost commented Aug 14, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

In almost every place we use {Last}IndexOf{Any} for a given character or set of characters, we can use {Last}IndexOfAnyExcept for the negated counterpart. This leads to cleaner generated code, and for longer searches, benefits from vectorization. This includes:

  • In set loops of a small number of negated chars, finding the end of the loop with IndexOfAnyExcept.
  • In loops followed by a set of a small number of negated chars, using LastIndexOfAnyExcept to find the next location to which to backtrack.
  • In Singleline .*? lazy loops followed by a set of a small number of negated chars, using IndexOfAnyExcept to find the next location to backtrack to.
  • In unbounded atomic one and set loops, using IndexOfAnyExcept to find the end of the loop

The two places we don't currently use IndexOf* but don't use the *Except variants are:

  • As part of TryFindNextPossibleStartingPosition. There appear to be relatively few patterns in the wild this would benefit, and we'd need custom logic for finding the set of negated starting characters.
  • As part of backtracking in a non-Singleline lazy loop. It needs to search for the target plus a \n, which works well when the target is a non-negated char (as it just adds the char to the set of things to search for) but doesn't when working with a negated char (as then different elements of the search have different polarities).
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@ghost ghost assigned stephentoub Aug 14, 2022
Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions but looks good otherwise

@stephentoub stephentoub merged commit e4f1b75 into dotnet:main Aug 15, 2022
@stephentoub stephentoub deleted the regexindexofanyexcept branch August 15, 2022 18:47
@ghost ghost locked as resolved and limited conversation to collaborators Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants