Improved handling of Antimirov mode in Nonbacktracking Regex engine #65637

veanes · 2022-02-20T21:20:52Z

Refactored how current state is represented by encapsulating it in a new struct CurrentState that is passed by reference in the Delta and TakeTransition methods and hides whether the state is DfaMatchingState or a set of NFA states that have now their own transition function _antimirovDelta maintained by the SymbolicRegexBuilder. Use of CurrentState should not incur any overhead over how DfaMatchingState was used before. Further improvements in the implementation of CurrentState might be possible using SparseIntMap, currently List and HashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods from SymbolicRegexMatcher to SymbolicRegexBuilder where they belong more naturally. Removed ITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover Watchdog, that has become dormant for seemingly similar reasons.)

Some code of CreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduce Or any more in many cases but rather OrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) in CreateNewNfaTransition
is currently never triggered and all OrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid of Or, or rather renaming OrderedOr to Or (and getting rid of the unordered semantics, the current mixture of Or and OrderedOr is a temporary headache). This will also make the SymbolicRegexNode AST more lightweight

Fixes #60918

ghost · 2022-02-20T21:21:02Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Refactored how current state is represented by encapsulating it in a new struct CurrentState that is passed by reference in the Delta and TakeTransition methods and hides whether the state is DfaMatchingState or a set of NFA states that have now their own transition function _antimirovDelta maintained by the SymbolicRegexBuilder. Use of CurrentState should not incur any overhead over how DfaMatchingState was used before. Further improvements in the implementation of CurrentState might be possible using SparseIntMap, currently List and HashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods from SymbolicRegexMatcher to SymbolicRegexBuilder where they belong more naturally. Removed ITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover Watchdog, that has become dormant for seemingly similar reasons.)

Some code of CreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduce Or any more in many cases but rather OrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) in CreateNewNfaTransition
is currently never triggered and all OrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid of Or, or rather renaming OrderedOr to Or (and getting rid of the unordered semantics, the current mixture of Or and OrderedOr is a temporary headache). This will also make the SymbolicRegexNode AST more lightweight

Author:	veanes
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

Wraith2 · 2022-02-20T23:15:40Z

Why is it called _antimirovDelta and is there something more generic it could be called?

veanes · 2022-02-21T01:09:00Z

The naming _antimirovDelta, _antimirovStatearray, _antimirovStatearrayInverse as well as the boolen flag _antimirov reflect that these mirror the values of NFA in _antimirov mode (NFA mode) as opposed to the default DFA (_delta and _statearray) in Brzozowski mode. The names could also be _nfaDelta, _nfaStatearray, and _nfaStatearrayInverse, but the names reflect that the intent that the concerned algorithms use Antimirov-style derivatives (leading to NFA) vs Brzozowski-style derivatives (leading to DFA). We may end up renaming these names as some later point, not sure. In general the name delta is often used for a transition function of an automaton. So long story short, the name _antimirovDelta is really _nfaTransitionFunction.

veanes · 2022-02-21T01:19:01Z

The failing tests are unrelated to the changes in this pr (or so it looks to me).

Wraith2 · 2022-02-21T02:12:25Z

The algorithm names and the distinction seems like the sort of thing that would live happily in a complex comment that explains the difference for users. The variable names being nfa or dfa would be both clear and more approachable than using the algorithm authors' names to me.

veanes · 2022-02-21T02:52:12Z

Point taken, we will probably end up doing a separate phase of renamings addressing the terminology issue in general, as a separate PR for this, e.g., renaming _antimirov to _nfaMode, etc., there are other similar cases in the code. We have been discussing this on and off for a while.

stephentoub · 2022-02-22T00:54:03Z

@veanes, I rebased your commits on main and fix up all the merge conflicts. I also added a commit that cleans up a few things, including avoiding the ToArray call in the NFA mode on every transition.

Before doing further work locally in your branch, you'll want to sync to these changes, which you can do with:

git fetch --all
git reset --hard origin/antimirov

If you've already done further work locally, don't do the above until you've made a backup of that work, e.g. by creating a new branch from it:

git checkout -b antimirovtemp
git commit -a -m "fix todos"

then doing the earlier commands, and then cherry-picking your last commit.

stephentoub · 2022-02-22T01:26:03Z

FYI, in its current state this ends up really regressing perf. I see a few obvious things I'll fix...

veanes · 2022-02-22T01:55:15Z

Thanks a lot for the rebase! I don't have any pending work other than a unit test case I was debugging the watchdog issue. I'll just stash this change and apply the first option above. I think perf could be addressed with SparseIntMap (@olsaarik know best how to apply it)

stephentoub · 2022-02-22T02:05:13Z

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I don't have any pending work

There are a variety of TODOs in the code, e.g. for OrderedOr. What's the plan for those? We're at a point where I'd like to avoid checking in lots of new open issues. If they're of the form "we could consider using XYZ to improve perf", that's fine, and we should have an issue tracking it. But if they're of the form "I'm not sure if this is functionally correct; we should investigate whether we need to be doing something different here", I'd like to address that as part of the change.

veanes · 2022-02-22T02:18:55Z

The plan for OrderedOr:
We get rid of Or and only have OrderedOr. The unordered Or case could be realized by ordering the elements in the OrderedOr collection by using hashcodes (if this would be needed, but I think not, since we want to now follow the exact (or best effort) same semantics for matches as in backtracking, in which case Or needs to be ordered always). In particular this would essentially get rid of SymbolicRegexSet used for _alts (conjunction case can be added back later as a form of And-collection using binary representation as in OrderedOr but not relevant now)

veanes · 2022-02-22T02:27:23Z

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode, the intent was that this should not happen very often, but I may have been wrong here (which is why I did not think allowing null initializing them to null would make much difference)

stephentoub · 2022-02-22T02:28:13Z

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I pushed a commit to fix some of the perf regressions. In particular, the change was allocating multiple lists/sets every time a new CurrentSet was created, even when in DFA mode, which was causing allocations to skyrocket and having a very measurable impact on throughput.

Even with that, though, there are some substantial perf regressions. Here's what the "sherlock" perf tests:
https://github.com/dotnet/performance/blob/996fb443bf081a976a2f92a0fc14d5f1c212f6f7/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L127-L169
look like for me when I run them locally:

Method	Toolchain	Pattern	Options	Mean	Ratio
Count	\main\corerun.exe	(?i)Holmes	NonBacktracking	659.04 us	1.00
Count	\pr\corerun.exe	(?i)Holmes	NonBacktracking	678.86 us	1.03

Count	\main\corerun.exe	(?i)Sher[a-z]+\|Hol[a-z]+	NonBacktracking	1,215.72 us	1.00
Count	\pr\corerun.exe	(?i)Sher[a-z]+\|Hol[a-z]+	NonBacktracking	1,252.77 us	1.03

Count	\main\corerun.exe	(?i)Sherlock	NonBacktracking	141.34 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock	NonBacktracking	154.75 us	1.10

Count	\main\corerun.exe	(?i)Sherlock Holmes	NonBacktracking	149.50 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock Holmes	NonBacktracking	161.83 us	1.08

Count	\main\corerun.exe	(?i)Sherlock\|Holmes\|Watson	NonBacktracking	3,546.32 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock\|Holmes\|Watson	NonBacktracking	3,690.29 us	1.04

Count	\main\corerun.exe	(?i)Sherlock\|(...)er\|John\|Baker [49]	NonBacktracking	4,896.03 us	1.00
Count	\pr\corerun.exe	(?i)Sherlock\|(...)er\|John\|Baker [49]	NonBacktracking	5,061.19 us	1.03

Count	\main\corerun.exe	(?i)the	NonBacktracking	1,534.91 us	1.00
Count	\pr\corerun.exe	(?i)the	NonBacktracking	1,894.79 us	1.24

Count	\main\corerun.exe	(?m)^Sherlock(...)rlock Holmes$ [37]	NonBacktracking	66.29 us	1.00
Count	\pr\corerun.exe	(?m)^Sherlock(...)rlock Holmes$ [37]	NonBacktracking	72.95 us	1.10

Count	\main\corerun.exe	(?s).*	NonBacktracking	3,287.32 us	1.00
Count	\pr\corerun.exe	(?s).*	NonBacktracking	4,450.44 us	1.35

Count	\main\corerun.exe	.*	NonBacktracking	4,374.97 us	1.00
Count	\pr\corerun.exe	.*	NonBacktracking	6,046.40 us	1.38

Count	\main\corerun.exe	Holmes	NonBacktracking	105.21 us	1.00
Count	\pr\corerun.exe	Holmes	NonBacktracking	135.95 us	1.29

Count	\main\corerun.exe	Holmes.{0,25}(...).{0,25}Holmes [39]	NonBacktracking	187.00 us	1.00
Count	\pr\corerun.exe	Holmes.{0,25}(...).{0,25}Holmes [39]	NonBacktracking	241.87 us	1.29

Count	\main\corerun.exe	Sher[a-z]+\|Hol[a-z]+	NonBacktracking	187.05 us	1.00
Count	\pr\corerun.exe	Sher[a-z]+\|Hol[a-z]+	NonBacktracking	235.15 us	1.26

Count	\main\corerun.exe	Sherlock	NonBacktracking	56.30 us	1.00
Count	\pr\corerun.exe	Sherlock	NonBacktracking	65.92 us	1.17

Count	\main\corerun.exe	Sherlock Holmes	NonBacktracking	65.58 us	1.00
Count	\pr\corerun.exe	Sherlock Holmes	NonBacktracking	76.54 us	1.17

Count	\main\corerun.exe	Sherlock\s+Holmes	NonBacktracking	79.20 us	1.00
Count	\pr\corerun.exe	Sherlock\s+Holmes	NonBacktracking	93.03 us	1.18

Count	\main\corerun.exe	Sherlock\|Holmes	NonBacktracking	191.97 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holmes	NonBacktracking	250.97 us	1.31

Count	\main\corerun.exe	Sherlock\|Holmes\|Watson	NonBacktracking	250.09 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holmes\|Watson	NonBacktracking	315.37 us	1.26

Count	\main\corerun.exe	Sherlock\|Holm(...)er\|John\|Baker [45]	NonBacktracking	2,949.07 us	1.00
Count	\pr\corerun.exe	Sherlock\|Holm(...)er\|John\|Baker [45]	NonBacktracking	3,011.03 us	1.02

Count	\main\corerun.exe	Sherlock\|Street	NonBacktracking	89.18 us	1.00
Count	\pr\corerun.exe	Sherlock\|Street	NonBacktracking	108.13 us	1.21

Count	\main\corerun.exe	The	NonBacktracking	128.29 us	1.00
Count	\pr\corerun.exe	The	NonBacktracking	151.60 us	1.20

Count	\main\corerun.exe	[^\n]*	NonBacktracking	4,304.93 us	1.00
Count	\pr\corerun.exe	[^\n]*	NonBacktracking	6,143.43 us	1.43

Count	\main\corerun.exe	[a-q][^u-z]{13}x	NonBacktracking	103.27 us	1.00
Count	\pr\corerun.exe	[a-q][^u-z]{13}x	NonBacktracking	119.98 us	1.16

Count	\main\corerun.exe	[a-zA-Z]+ing	NonBacktracking	6,801.28 us	1.00
Count	\pr\corerun.exe	[a-zA-Z]+ing	NonBacktracking	9,346.86 us	1.37

Count	\main\corerun.exe	\b\w+n\b	NonBacktracking	8,425.09 us	1.00
Count	\pr\corerun.exe	\b\w+n\b	NonBacktracking	11,394.81 us	1.36

Count	\main\corerun.exe	\p{Ll}	NonBacktracking	29,176.29 us	1.00
Count	\pr\corerun.exe	\p{Ll}	NonBacktracking	44,094.94 us	1.51

Count	\main\corerun.exe	\p{Lu}	NonBacktracking	1,909.84 us	1.00
Count	\pr\corerun.exe	\p{Lu}	NonBacktracking	2,445.12 us	1.28

Count	\main\corerun.exe	\p{L}	NonBacktracking	30,398.30 us	1.00
Count	\pr\corerun.exe	\p{L}	NonBacktracking	45,196.70 us	1.49

Count	\main\corerun.exe	\s[a-zA-Z]{0,12}ing\s	NonBacktracking	4,454.26 us	1.00
Count	\pr\corerun.exe	\s[a-zA-Z]{0,12}ing\s	NonBacktracking	6,538.71 us	1.47

Count	\main\corerun.exe	\w+	NonBacktracking	13,048.35 us	1.00
Count	\pr\corerun.exe	\w+	NonBacktracking	19,288.41 us	1.48

Count	\main\corerun.exe	\w+\s+Holmes	NonBacktracking	3,883.77 us	1.00
Count	\pr\corerun.exe	\w+\s+Holmes	NonBacktracking	6,294.95 us	1.61

Count	\main\corerun.exe	\w+\s+Holmes\s+\w+	NonBacktracking	3,823.51 us	1.00
Count	\pr\corerun.exe	\w+\s+Holmes\s+\w+	NonBacktracking	6,066.32 us	1.59

Count	\main\corerun.exe	aei	NonBacktracking	51.11 us	1.00
Count	\pr\corerun.exe	aei	NonBacktracking	51.54 us	1.01

Count	\main\corerun.exe	aqj	NonBacktracking	39.11 us	1.00
Count	\pr\corerun.exe	aqj	NonBacktracking	38.40 us	0.99

Count	\main\corerun.exe	the	NonBacktracking	824.52 us	1.00
Count	\pr\corerun.exe	the	NonBacktracking	1,160.05 us	1.41

Count	\main\corerun.exe	the\s+\w+	NonBacktracking	1,320.44 us	1.00
Count	\pr\corerun.exe	the\s+\w+	NonBacktracking	1,879.80 us	1.42

Count	\main\corerun.exe	zqj	NonBacktracking	37.05 us	1.00
Count	\pr\corerun.exe	zqj	NonBacktracking	37.06 us	1.00

I don't think any of those are exercising the NFA mode (?), so this is a hit to the DFA perf.

stephentoub · 2022-02-22T02:29:37Z

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode

It happens multiple times per match operation.

veanes · 2022-02-22T02:41:58Z

I am puzzled by this, as CurrentState is passed by reference in Delta and TakeTransition and should only be created once (unless NFA mode kicks in, at least this was the intent).
Ohhh, I see, the issue is that it should be reused when new initial (DfaMatchingState) is created, as opposed to be recreated in FindEndPosition, FindStartPosition etc. CurrentState should be passed into those methods as parameter. -- my bad.

veanes · 2022-02-22T02:46:36Z

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

veanes · 2022-02-22T02:49:32Z

I'm fixing it right now.

stephentoub · 2022-02-22T02:49:45Z

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

That will still create the lists/sets once per match operation. The fix I have here creates them once per runner, allowing those lists/sets to be reused across matches, which is something we'll want to maintain.

If we can go further and pass the current state by ref into those helpers instead of also passing in PerThreadState and creating a new CurrentState in each of those, great. But I'm not clear how that relates to the argument being passed in to CurrentState in each of those locations... it's already going to be in the right state?

I'm not currently editing this (and don't plan to this evening). Please feel free to pull it down and tweak it according to your recommendation.

Thanks.

veanes · 2022-02-22T02:57:05Z

CurrentState is really a ref for holding the actual state, when passing it in it initially it will passed say as the first argument to
FindFinalStatePosition,

s =new CurrentState<TSetType>(_dotstarredInitialStates[GetCharKind(input, i - 1)])
FindFinalStatePosition(s, ... old arguments)

Well, I'll fix this and push the fix later this evening.

veanes · 2022-02-22T05:15:44Z

Oh,, I forgot to pull your edits @stephentoub .. so now I'm getting just collisions, we were doing similar things in slightly different ways ... essentially perThreadData vs passing the ref to CurrentState (that is per thread obviously due to being created in the method call), not sure what the best option is here exactly, but I think using CurrentState as it was intended to (i.e., fixing my bug, just creating it once and passing it by ref is still the right thing to do). At the same time, I think that the Boolean _antimirov should be per thread data also, currently it is not.

veanes · 2022-02-22T06:03:21Z

I merged the conflicts, kept both edits, so passing the perThreadData, I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

stephentoub · 2022-02-22T13:09:01Z

I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

I don't understand how _antimirov is supposed to work. The code is flipping it back to false for phase 2/3; doesn't that mean the next match performed will start off creating new nodes as DFA even if it already exceeded the threshold?

Why do we need this flag at all? e.g. why not just decide when lazily creating a new transition whether it should be DFA or NFA, based on the current size of the graph (or based on whether we've initialized the NFA state lists/sets), rather than based on a persisted bool?

stephentoub · 2022-02-23T18:02:15Z

I've got the regression mostly addressed. But I'm trying to figure out now why the we're never ending up with antimirov set to true. I don't think it's from my changes, so maybe from something in Margus' initial changes here. I'll keep looking at it and push a new commit soon.

olsaarik · 2022-02-23T18:28:14Z

Here's my idea fleshed out in code: https://github.com/olsaarik/runtime/blob/antimirov-fix/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L797-L900 . One change that seemed necessary was to not pass in the ICurrentState as ref, since an assignment to that at the end would necessarily do the boxing (although only once per match, so might be fine).

stephentoub · 2022-02-23T18:32:39Z

What I have in my local changes is a CurrentState struct that's just a discriminated union of DfaMatchingState and an NfaMatchingState. We pass that around by ref, and we do the same static interface constraint we currently do in main, with DfaStateHandler and NfaStateHandler types implementing those interfaces, which operate over the ref CurrentState passed in.

olsaarik · 2022-02-23T18:34:51Z

I think I like the sound of your solution better. The recursive call into the same function with a different type that I had to do is a bit surprising.

stephentoub · 2022-02-23T18:37:58Z

Funny, I was thinking the same about yours. :-) I'll experiment with a hybrid.

olsaarik · 2022-02-23T18:46:43Z

By the way in my solution I think the check for antimirov should actually happen at the start of the outer loop rather than at the end to avoid new match calls temporarily being in DFA mode when they should be in NFA already.

veanes · 2022-02-23T19:23:07Z

Thanks for helping out with this this!
Looks like there is definite progress.
I'm busy today with a presentation I have to give tomorrow so had to switch context.

olsaarik · 2022-02-23T19:41:00Z

So here's a fun "alternative" approach: https://github.com/olsaarik/runtime/blob/0d51dc4174e0ef726ce6f75734af79719a5ef04f/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L866-L878 DfaMatchingState.Next throws that NoMoreDfaException when it would create a new state over the antimirov bound and then the whole thing switches over to NFA mode. It only happens once per match call and due to the "zero cost" nature of exceptions shouldn't slow down DFA mode at all.

Not sure this is a serious proposal, but I love using exceptions for control flow :D

stephentoub · 2022-02-24T03:56:54Z

@olsaarik, @veanes, @joperezr, I pushed a new set of changes. Please take a look. From the commit description:

Overhaul SymbolicRegexMatcher to avoid DFA perf regressions from NFA changes
This does a few things:

The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.

Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.

Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.

Avoids inlining huge amounts of NFA code that can result in worse DFA code

Renamed Brzozowski and Antimirov to DFA and NFA.

Added a lot of comments to SymbolicRegexMatcher

Fixes "watchdog" issue with captures

A few other renames / cleanups.

I still see some variance in the perf tests, but nothing like before. Most of the tests are now +/- 10% from where they were before. A few tests get better by upwards of 20% while a few get worse by a similar margin. There's also still room for microoptimization on the fast path, but I'd like to look at that separately.

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs

veanes · 2022-02-25T19:26:28Z

I wanted to add that I went through all the new changes involving use of IStateHandler. It looks really clean to me and I'm happy to see the use of SparseIntMap. Definitely looks very solid and clear, I'm really happy this is converging. I do not have very strong viewpoint on the special case of one state. But I know one thing for sure. Having single state in NFA transitions is very common (e.g. the current test cases are like that, I debugged and checked), most if not all nondeterminism is because of the set not because of the NFA state also possibly being nondeterministic. So special casing could make sense if there is even a small gain.

veanes · 2022-02-25T19:33:07Z

I was actually thinking of the List of states in the NFA transition function ... Still having a single state in the set itself may also occur, as the state becomes "deteriministic" temporarily, and could continue to be so for while, this obviously depends on a combination of circumstances triggered by a particular input that makes the behavior deterministic for that particular case (e.g. returning to the "initial" state that just loops around in it, if the initial state is special cased and not split into a set --- an optimization we discussed with @olsaarik )

veanes · 2022-02-25T20:31:34Z

Concretely, one optimization would be in NfaStateHandler:
public static bool IsInitialState(ref CurrentState state) =>

            if  state.NfaState!.NfaStateSet.Count == 1 then 
            builder.GetCoreState(nfaState.NfaState!.NfaStateSet[0].IsInitialState)
            else false

This would allow the top-level loop to know that we are still in initial state and take advantage of the related optimizations used in DFA mode (initial prefix search etc)

Obviously, this calls for a proper NFA mode set of benchmarks we currently don't have.

stephentoub · 2022-02-25T21:36:37Z

Concretely, one optimization would be in NfaStateHandler

Thanks. Let's look at that subsequently. I didn't apply it for now as that would also require changing the main loop that assumes NFA states are never initial and thus only looks to see whether the current state is a DFA initial state.

veanes · 2022-02-25T21:38:37Z

Sounds good, just wanted to point this out, to keep this in mind.

…changes This does a few things: - The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path. - Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint. - Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop. - Avoids inlining huge amounts of NFA code that can result in worse DFA code - Renamed Brzozowski and Antimirov to DFA and NFA. - Added a lot of comments to SymbolicRegexMatcher - Fixes "watchdog" issue with captures - A few other renames / cleanups.

veanes requested review from danmoseley, joperezr, olsaarik and stephentoub February 20, 2022 21:20

ghost added the area-System.Text.RegularExpressions label Feb 20, 2022

ghost assigned veanes Feb 20, 2022

veanes mentioned this pull request Feb 21, 2022

Optimize symbolic regex Antimirov mode to avoid so much allocation #60918

Closed

stephentoub force-pushed the antimirov branch from 9aa8771 to 2426b0b Compare February 22, 2022 00:51

stephentoub force-pushed the antimirov branch 2 times, most recently from a6fd2c5 to 61805ba Compare February 24, 2022 03:55

stephentoub force-pushed the antimirov branch 2 times, most recently from d829db0 to 0c63092 Compare February 25, 2022 03:34

olsaarik reviewed Feb 25, 2022

View reviewed changes

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs Outdated Show resolved Hide resolved

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs Outdated Show resolved Hide resolved

stephentoub force-pushed the antimirov branch from 0c63092 to 9c97d46 Compare February 25, 2022 21:34

olsaarik approved these changes Feb 25, 2022

View reviewed changes

veanes and others added 2 commits February 25, 2022 19:38

Improvements to Antimirov mode and misc small formatting edits

770d300

stephentoub force-pushed the antimirov branch from 9c97d46 to b086223 Compare February 26, 2022 00:38

stephentoub merged commit 83e5cbc into dotnet:main Feb 26, 2022

ghost locked as resolved and limited conversation to collaborators Mar 28, 2022

Improved handling of Antimirov mode in Nonbacktracking Regex engine #65637

Improved handling of Antimirov mode in Nonbacktracking Regex engine #65637

Uh oh!

Conversation

veanes commented Feb 20, 2022 • edited by stephentoub Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 20, 2022

Uh oh!

Wraith2 commented Feb 20, 2022

Uh oh!

veanes commented Feb 21, 2022

Uh oh!

veanes commented Feb 21, 2022

Uh oh!

Wraith2 commented Feb 21, 2022

Uh oh!

veanes commented Feb 21, 2022

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

veanes commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 22, 2022

Uh oh!

stephentoub commented Feb 23, 2022

Uh oh!

olsaarik commented Feb 23, 2022

Uh oh!

stephentoub commented Feb 23, 2022

Uh oh!

olsaarik commented Feb 23, 2022

Uh oh!

stephentoub commented Feb 23, 2022

Uh oh!

olsaarik commented Feb 23, 2022

Uh oh!

veanes commented Feb 23, 2022

Uh oh!

olsaarik commented Feb 23, 2022

Uh oh!

stephentoub commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veanes commented Feb 25, 2022

Uh oh!

veanes commented Feb 25, 2022

Uh oh!

veanes commented Feb 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

veanes commented Feb 20, 2022 •

edited by stephentoub

Loading

stephentoub commented Feb 22, 2022 •

edited

Loading

stephentoub commented Feb 24, 2022 •

edited

Loading

veanes commented Feb 25, 2022 •

edited

Loading