Skip to content

Improve Regex's knowledge of captures and atomicity in the tree #62451

@stephentoub

Description

@stephentoub

Handling captures involves additional code, e.g. uncapturing after a failed match attempt. For both RegexOptions.Compiled and the source generator, we output code to both track where we need to uncapture to and to then perform that uncapture if it's possible there may be captures involved. But today the analysis for this is relatively limited: we mark each node in the tree as to whether it contains captures, and that allows us to avoid the boilerplate uncapture code after failures in child nodes. But we don't analyze/track whether any nodes after a given node involve captures, and that's relevant for backtracking, e.g. today for the expression (a*)b*[bc] the codegen for that b* loop will involve the uncapture boilerplate because it'll see that the whole expression contains captures, even though if backtracking occurs due to the [bc] not matching, there is provably nothing that will need to be uncaptured. We should change how we annotate the tree and include this idea of "is a node followed by a capture". This could be done, for example, by walking the tree backwards, tracking whether we've seen a capture, and if we have, marking every future node we see.

We have a similar issue for atomic. Code gen can be much better for a variety of constructs if they're atomic. Today we walk up the tree from a node to see if anything in its parent hierarchy makes it atomic, but worst case that could be an O(N^2) algorithm (as part of construction, not matching). We could instead do a single O(N) pass over the tree to compute this for every node.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions