Skip to content

Conversation

@veanes
Copy link
Contributor

@veanes veanes commented Feb 20, 2022

Refactored how current state is represented by encapsulating it in a new struct CurrentState that is passed by reference in the Delta and TakeTransition methods and hides whether the state is DfaMatchingState or a set of NFA states that have now their own transition function _antimirovDelta maintained by the SymbolicRegexBuilder. Use of CurrentState should not incur any overhead over how DfaMatchingState was used before. Further improvements in the implementation of CurrentState might be possible using SparseIntMap, currently List and HashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods from SymbolicRegexMatcher to SymbolicRegexBuilder where they belong more naturally. Removed ITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover Watchdog, that has become dormant for seemingly similar reasons.)

Some code of CreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduce Or any more in many cases but rather OrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) in CreateNewNfaTransition
is currently never triggered and all OrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid of Or, or rather renaming OrderedOr to Or (and getting rid of the unordered semantics, the current mixture of Or and OrderedOr is a temporary headache). This will also make the SymbolicRegexNode AST more lightweight

Fixes #60918

@ghost
Copy link

ghost commented Feb 20, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Refactored how current state is represented by encapsulating it in a new struct CurrentState that is passed by reference in the Delta and TakeTransition methods and hides whether the state is DfaMatchingState or a set of NFA states that have now their own transition function _antimirovDelta maintained by the SymbolicRegexBuilder. Use of CurrentState should not incur any overhead over how DfaMatchingState was used before. Further improvements in the implementation of CurrentState might be possible using SparseIntMap, currently List and HashSet are being used.

The update should improve Antimorov mode performance, but was also useful for code cleanup in general.

Moved several methods from SymbolicRegexMatcher to SymbolicRegexBuilder where they belong more naturally. Removed ITransition interface and the associated Brzozowski and Antimirov structs as they are no longer needed.

Updated some unit tests to actually cause Antimirov mode to be triggered. Some new optimizations had caused the mode not to be triggered any more (e.g. for the regex a.{20}$ where the index was advanced immediately to the 20th character from the end). (I will add some unit test to cover Watchdog, that has become dormant for seemingly similar reasons.)

Some code of CreateNewNfaTransition concering nondeterministic behavior from a single NFA state is currently not detected because derivatives do not introduce Or any more in many cases but rather OrderedOr so the important case
if (coretarget.Node.Kind == SymbolicRegexKind.Or) in CreateNewNfaTransition
is currently never triggered and all OrderedOr nodes are treated as single states thus sidestepping the Antimirov mode.

TODO: Getting rid of Or, or rather renaming OrderedOr to Or (and getting rid of the unordered semantics, the current mixture of Or and OrderedOr is a temporary headache). This will also make the SymbolicRegexNode AST more lightweight

Author: veanes
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 20, 2022

Why is it called _antimirovDelta and is there something more generic it could be called?

@veanes
Copy link
Contributor Author

veanes commented Feb 21, 2022

The naming _antimirovDelta, _antimirovStatearray, _antimirovStatearrayInverse as well as the boolen flag _antimirov reflect that these mirror the values of NFA in _antimirov mode (NFA mode) as opposed to the default DFA (_delta and _statearray) in Brzozowski mode. The names could also be _nfaDelta, _nfaStatearray, and _nfaStatearrayInverse, but the names reflect that the intent that the concerned algorithms use Antimirov-style derivatives (leading to NFA) vs Brzozowski-style derivatives (leading to DFA). We may end up renaming these names as some later point, not sure. In general the name delta is often used for a transition function of an automaton. So long story short, the name _antimirovDelta is really _nfaTransitionFunction.

@veanes
Copy link
Contributor Author

veanes commented Feb 21, 2022

The failing tests are unrelated to the changes in this pr (or so it looks to me).

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 21, 2022

The algorithm names and the distinction seems like the sort of thing that would live happily in a complex comment that explains the difference for users. The variable names being nfa or dfa would be both clear and more approachable than using the algorithm authors' names to me.

@veanes
Copy link
Contributor Author

veanes commented Feb 21, 2022

Point taken, we will probably end up doing a separate phase of renamings addressing the terminology issue in general, as a separate PR for this, e.g., renaming _antimirov to _nfaMode, etc., there are other similar cases in the code. We have been discussing this on and off for a while.

@stephentoub
Copy link
Member

@veanes, I rebased your commits on main and fix up all the merge conflicts. I also added a commit that cleans up a few things, including avoiding the ToArray call in the NFA mode on every transition.

Before doing further work locally in your branch, you'll want to sync to these changes, which you can do with:

git fetch --all
git reset --hard origin/antimirov

If you've already done further work locally, don't do the above until you've made a backup of that work, e.g. by creating a new branch from it:

git checkout -b antimirovtemp
git commit -a -m "fix todos"

then doing the earlier commands, and then cherry-picking your last commit.

@stephentoub
Copy link
Member

FYI, in its current state this ends up really regressing perf. I see a few obvious things I'll fix...

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

Thanks a lot for the rebase! I don't have any pending work other than a unit test case I was debugging the watchdog issue. I'll just stash this change and apply the first option above. I think perf could be addressed with SparseIntMap (@olsaarik know best how to apply it)

@stephentoub
Copy link
Member

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I don't have any pending work

There are a variety of TODOs in the code, e.g. for OrderedOr. What's the plan for those? We're at a point where I'd like to avoid checking in lots of new open issues. If they're of the form "we could consider using XYZ to improve perf", that's fine, and we should have an issue tracking it. But if they're of the form "I'm not sure if this is functionally correct; we should investigate whether we need to be doing something different here", I'd like to address that as part of the change.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

The plan for OrderedOr:
We get rid of Or and only have OrderedOr. The unordered Or case could be realized by ordering the elements in the OrderedOr collection by using hashcodes (if this would be needed, but I think not, since we want to now follow the exact (or best effort) same semantics for matches as in backtracking, in which case Or needs to be ordered always). In particular this would essentially get rid of SymbolicRegexSet used for _alts (conjunction case can be added back later as a form of And-collection using binary representation as in OrderedOr but not relevant now)

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode, the intent was that this should not happen very often, but I may have been wrong here (which is why I did not think allowing null initializing them to null would make much difference)

@stephentoub
Copy link
Member

stephentoub commented Feb 22, 2022

I think perf could be addressed with SparseIntMap

That's not the issue I'm fixing. Every time CurrentState is being created, it's creating new lists and sets... even when in DFA mode. I'll push a new commit here shortly.

I pushed a commit to fix some of the perf regressions. In particular, the change was allocating multiple lists/sets every time a new CurrentSet was created, even when in DFA mode, which was causing allocations to skyrocket and having a very measurable impact on throughput.

Even with that, though, there are some substantial perf regressions. Here's what the "sherlock" perf tests:
https://github.com/dotnet/performance/blob/996fb443bf081a976a2f92a0fc14d5f1c212f6f7/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L127-L169
look like for me when I run them locally:

Method Toolchain Pattern Options Mean Ratio
Count \main\corerun.exe (?i)Holmes NonBacktracking 659.04 us 1.00
Count \pr\corerun.exe (?i)Holmes NonBacktracking 678.86 us 1.03
Count \main\corerun.exe (?i)Sher[a-z]+|Hol[a-z]+ NonBacktracking 1,215.72 us 1.00
Count \pr\corerun.exe (?i)Sher[a-z]+|Hol[a-z]+ NonBacktracking 1,252.77 us 1.03
Count \main\corerun.exe (?i)Sherlock NonBacktracking 141.34 us 1.00
Count \pr\corerun.exe (?i)Sherlock NonBacktracking 154.75 us 1.10
Count \main\corerun.exe (?i)Sherlock Holmes NonBacktracking 149.50 us 1.00
Count \pr\corerun.exe (?i)Sherlock Holmes NonBacktracking 161.83 us 1.08
Count \main\corerun.exe (?i)Sherlock|Holmes|Watson NonBacktracking 3,546.32 us 1.00
Count \pr\corerun.exe (?i)Sherlock|Holmes|Watson NonBacktracking 3,690.29 us 1.04
Count \main\corerun.exe (?i)Sherlock|(...)er|John|Baker [49] NonBacktracking 4,896.03 us 1.00
Count \pr\corerun.exe (?i)Sherlock|(...)er|John|Baker [49] NonBacktracking 5,061.19 us 1.03
Count \main\corerun.exe (?i)the NonBacktracking 1,534.91 us 1.00
Count \pr\corerun.exe (?i)the NonBacktracking 1,894.79 us 1.24
Count \main\corerun.exe (?m)^Sherlock(...)rlock Holmes$ [37] NonBacktracking 66.29 us 1.00
Count \pr\corerun.exe (?m)^Sherlock(...)rlock Holmes$ [37] NonBacktracking 72.95 us 1.10
Count \main\corerun.exe (?s).* NonBacktracking 3,287.32 us 1.00
Count \pr\corerun.exe (?s).* NonBacktracking 4,450.44 us 1.35
Count \main\corerun.exe .* NonBacktracking 4,374.97 us 1.00
Count \pr\corerun.exe .* NonBacktracking 6,046.40 us 1.38
Count \main\corerun.exe Holmes NonBacktracking 105.21 us 1.00
Count \pr\corerun.exe Holmes NonBacktracking 135.95 us 1.29
Count \main\corerun.exe Holmes.{0,25}(...).{0,25}Holmes [39] NonBacktracking 187.00 us 1.00
Count \pr\corerun.exe Holmes.{0,25}(...).{0,25}Holmes [39] NonBacktracking 241.87 us 1.29
Count \main\corerun.exe Sher[a-z]+|Hol[a-z]+ NonBacktracking 187.05 us 1.00
Count \pr\corerun.exe Sher[a-z]+|Hol[a-z]+ NonBacktracking 235.15 us 1.26
Count \main\corerun.exe Sherlock NonBacktracking 56.30 us 1.00
Count \pr\corerun.exe Sherlock NonBacktracking 65.92 us 1.17
Count \main\corerun.exe Sherlock Holmes NonBacktracking 65.58 us 1.00
Count \pr\corerun.exe Sherlock Holmes NonBacktracking 76.54 us 1.17
Count \main\corerun.exe Sherlock\s+Holmes NonBacktracking 79.20 us 1.00
Count \pr\corerun.exe Sherlock\s+Holmes NonBacktracking 93.03 us 1.18
Count \main\corerun.exe Sherlock|Holmes NonBacktracking 191.97 us 1.00
Count \pr\corerun.exe Sherlock|Holmes NonBacktracking 250.97 us 1.31
Count \main\corerun.exe Sherlock|Holmes|Watson NonBacktracking 250.09 us 1.00
Count \pr\corerun.exe Sherlock|Holmes|Watson NonBacktracking 315.37 us 1.26
Count \main\corerun.exe Sherlock|Holm(...)er|John|Baker [45] NonBacktracking 2,949.07 us 1.00
Count \pr\corerun.exe Sherlock|Holm(...)er|John|Baker [45] NonBacktracking 3,011.03 us 1.02
Count \main\corerun.exe Sherlock|Street NonBacktracking 89.18 us 1.00
Count \pr\corerun.exe Sherlock|Street NonBacktracking 108.13 us 1.21
Count \main\corerun.exe The NonBacktracking 128.29 us 1.00
Count \pr\corerun.exe The NonBacktracking 151.60 us 1.20
Count \main\corerun.exe [^\n]* NonBacktracking 4,304.93 us 1.00
Count \pr\corerun.exe [^\n]* NonBacktracking 6,143.43 us 1.43
Count \main\corerun.exe [a-q][^u-z]{13}x NonBacktracking 103.27 us 1.00
Count \pr\corerun.exe [a-q][^u-z]{13}x NonBacktracking 119.98 us 1.16
Count \main\corerun.exe [a-zA-Z]+ing NonBacktracking 6,801.28 us 1.00
Count \pr\corerun.exe [a-zA-Z]+ing NonBacktracking 9,346.86 us 1.37
Count \main\corerun.exe \b\w+n\b NonBacktracking 8,425.09 us 1.00
Count \pr\corerun.exe \b\w+n\b NonBacktracking 11,394.81 us 1.36
Count \main\corerun.exe \p{Ll} NonBacktracking 29,176.29 us 1.00
Count \pr\corerun.exe \p{Ll} NonBacktracking 44,094.94 us 1.51
Count \main\corerun.exe \p{Lu} NonBacktracking 1,909.84 us 1.00
Count \pr\corerun.exe \p{Lu} NonBacktracking 2,445.12 us 1.28
Count \main\corerun.exe \p{L} NonBacktracking 30,398.30 us 1.00
Count \pr\corerun.exe \p{L} NonBacktracking 45,196.70 us 1.49
Count \main\corerun.exe \s[a-zA-Z]{0,12}ing\s NonBacktracking 4,454.26 us 1.00
Count \pr\corerun.exe \s[a-zA-Z]{0,12}ing\s NonBacktracking 6,538.71 us 1.47
Count \main\corerun.exe \w+ NonBacktracking 13,048.35 us 1.00
Count \pr\corerun.exe \w+ NonBacktracking 19,288.41 us 1.48
Count \main\corerun.exe \w+\s+Holmes NonBacktracking 3,883.77 us 1.00
Count \pr\corerun.exe \w+\s+Holmes NonBacktracking 6,294.95 us 1.61
Count \main\corerun.exe \w+\s+Holmes\s+\w+ NonBacktracking 3,823.51 us 1.00
Count \pr\corerun.exe \w+\s+Holmes\s+\w+ NonBacktracking 6,066.32 us 1.59
Count \main\corerun.exe aei NonBacktracking 51.11 us 1.00
Count \pr\corerun.exe aei NonBacktracking 51.54 us 1.01
Count \main\corerun.exe aqj NonBacktracking 39.11 us 1.00
Count \pr\corerun.exe aqj NonBacktracking 38.40 us 0.99
Count \main\corerun.exe the NonBacktracking 824.52 us 1.00
Count \pr\corerun.exe the NonBacktracking 1,160.05 us 1.41
Count \main\corerun.exe the\s+\w+ NonBacktracking 1,320.44 us 1.00
Count \pr\corerun.exe the\s+\w+ NonBacktracking 1,879.80 us 1.42
Count \main\corerun.exe zqj NonBacktracking 37.05 us 1.00
Count \pr\corerun.exe zqj NonBacktracking 37.06 us 1.00

I don't think any of those are exercising the NFA mode (?), so this is a hit to the DFA perf.

@stephentoub
Copy link
Member

Regarding the continuous creation of CurrentState, this happens only when a shift happens from DFA to NFA mode

It happens multiple times per match operation.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

I am puzzled by this, as CurrentState is passed by reference in Delta and TakeTransition and should only be created once (unless NFA mode kicks in, at least this was the intent).
Ohhh, I see, the issue is that it should be reused when new initial (DfaMatchingState) is created, as opposed to be recreated in FindEndPosition, FindStartPosition etc. CurrentState should be passed into those methods as parameter. -- my bad.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

I'm fixing it right now.

@stephentoub
Copy link
Member

Unless you are editing this right now I would suggest a fix this by passing the ref to CurrentState in FindEndPosition and FindStartPosition and FindFinalStatePosition and creating it in FindMatch once.

That will still create the lists/sets once per match operation. The fix I have here creates them once per runner, allowing those lists/sets to be reused across matches, which is something we'll want to maintain.

If we can go further and pass the current state by ref into those helpers instead of also passing in PerThreadState and creating a new CurrentState in each of those, great. But I'm not clear how that relates to the argument being passed in to CurrentState in each of those locations... it's already going to be in the right state?

I'm not currently editing this (and don't plan to this evening). Please feel free to pull it down and tweak it according to your recommendation.

Thanks.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

CurrentState is really a ref for holding the actual state, when passing it in it initially it will passed say as the first argument to
FindFinalStatePosition,

s =new CurrentState<TSetType>(_dotstarredInitialStates[GetCharKind(input, i - 1)])
FindFinalStatePosition(s, ... old arguments)

Well, I'll fix this and push the fix later this evening.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

Oh,, I forgot to pull your edits @stephentoub .. so now I'm getting just collisions, we were doing similar things in slightly different ways ... essentially perThreadData vs passing the ref to CurrentState (that is per thread obviously due to being created in the method call), not sure what the best option is here exactly, but I think using CurrentState as it was intended to (i.e., fixing my bug, just creating it once and passing it by ref is still the right thing to do). At the same time, I think that the Boolean _antimirov should be per thread data also, currently it is not.

@veanes
Copy link
Contributor Author

veanes commented Feb 22, 2022

I merged the conflicts, kept both edits, so passing the perThreadData, I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

@stephentoub
Copy link
Member

I think _antimirov should be a flag in perThreadData (currently it is not, it is in the builder).

I don't understand how _antimirov is supposed to work. The code is flipping it back to false for phase 2/3; doesn't that mean the next match performed will start off creating new nodes as DFA even if it already exceeded the threshold?

Why do we need this flag at all? e.g. why not just decide when lazily creating a new transition whether it should be DFA or NFA, based on the current size of the graph (or based on whether we've initialized the NFA state lists/sets), rather than based on a persisted bool?

@stephentoub
Copy link
Member

I've got the regression mostly addressed. But I'm trying to figure out now why the we're never ending up with antimirov set to true. I don't think it's from my changes, so maybe from something in Margus' initial changes here. I'll keep looking at it and push a new commit soon.

@olsaarik
Copy link
Contributor

Here's my idea fleshed out in code: https://github.com/olsaarik/runtime/blob/antimirov-fix/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L797-L900 . One change that seemed necessary was to not pass in the ICurrentState as ref, since an assignment to that at the end would necessarily do the boxing (although only once per match, so might be fine).

@stephentoub
Copy link
Member

What I have in my local changes is a CurrentState struct that's just a discriminated union of DfaMatchingState and an NfaMatchingState. We pass that around by ref, and we do the same static interface constraint we currently do in main, with DfaStateHandler and NfaStateHandler types implementing those interfaces, which operate over the ref CurrentState passed in.

@olsaarik
Copy link
Contributor

I think I like the sound of your solution better. The recursive call into the same function with a different type that I had to do is a bit surprising.

@stephentoub
Copy link
Member

Funny, I was thinking the same about yours. :-) I'll experiment with a hybrid.

@olsaarik
Copy link
Contributor

By the way in my solution I think the check for antimirov should actually happen at the start of the outer loop rather than at the end to avoid new match calls temporarily being in DFA mode when they should be in NFA already.

@veanes
Copy link
Contributor Author

veanes commented Feb 23, 2022

Thanks for helping out with this this!
Looks like there is definite progress.
I'm busy today with a presentation I have to give tomorrow so had to switch context.

@olsaarik
Copy link
Contributor

So here's a fun "alternative" approach: https://github.com/olsaarik/runtime/blob/0d51dc4174e0ef726ce6f75734af79719a5ef04f/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs#L866-L878 DfaMatchingState.Next throws that NoMoreDfaException when it would create a new state over the antimirov bound and then the whole thing switches over to NFA mode. It only happens once per match call and due to the "zero cost" nature of exceptions shouldn't slow down DFA mode at all.

Not sure this is a serious proposal, but I love using exceptions for control flow :D

@stephentoub stephentoub force-pushed the antimirov branch 2 times, most recently from a6fd2c5 to 61805ba Compare February 24, 2022 03:55
@stephentoub
Copy link
Member

stephentoub commented Feb 24, 2022

@olsaarik, @veanes, @joperezr, I pushed a new set of changes. Please take a look. From the commit description:

Overhaul SymbolicRegexMatcher to avoid DFA perf regressions from NFA changes
This does a few things:

  • The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states. Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode. This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing. The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.
  • Overhauled how we switch from DFA mode to NFA mode. Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode. We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.
  • Overhauled how we handle the outer/inner loop logic. Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size. Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.
  • Avoids inlining huge amounts of NFA code that can result in worse DFA code
  • Renamed Brzozowski and Antimirov to DFA and NFA.
  • Added a lot of comments to SymbolicRegexMatcher
  • Fixes "watchdog" issue with captures
  • A few other renames / cleanups.

I still see some variance in the perf tests, but nothing like before. Most of the tests are now +/- 10% from where they were before. A few tests get better by upwards of 20% while a few get worse by a similar margin. There's also still room for microoptimization on the fast path, but I'd like to look at that separately.

@stephentoub stephentoub force-pushed the antimirov branch 2 times, most recently from d829db0 to 0c63092 Compare February 25, 2022 03:34
@veanes
Copy link
Contributor Author

veanes commented Feb 25, 2022

I wanted to add that I went through all the new changes involving use of IStateHandler. It looks really clean to me and I'm happy to see the use of SparseIntMap. Definitely looks very solid and clear, I'm really happy this is converging. I do not have very strong viewpoint on the special case of one state. But I know one thing for sure. Having single state in NFA transitions is very common (e.g. the current test cases are like that, I debugged and checked), most if not all nondeterminism is because of the set not because of the NFA state also possibly being nondeterministic. So special casing could make sense if there is even a small gain.

@veanes
Copy link
Contributor Author

veanes commented Feb 25, 2022

I was actually thinking of the List of states in the NFA transition function ... Still having a single state in the set itself may also occur, as the state becomes "deteriministic" temporarily, and could continue to be so for while, this obviously depends on a combination of circumstances triggered by a particular input that makes the behavior deterministic for that particular case (e.g. returning to the "initial" state that just loops around in it, if the initial state is special cased and not split into a set --- an optimization we discussed with @olsaarik )

@veanes
Copy link
Contributor Author

veanes commented Feb 25, 2022

Concretely, one optimization would be in NfaStateHandler:
public static bool IsInitialState(ref CurrentState state) =>

            if  state.NfaState!.NfaStateSet.Count == 1 then 
            builder.GetCoreState(nfaState.NfaState!.NfaStateSet[0].IsInitialState)
            else false

This would allow the top-level loop to know that we are still in initial state and take advantage of the related optimizations used in DFA mode (initial prefix search etc)

Obviously, this calls for a proper NFA mode set of benchmarks we currently don't have.

@stephentoub
Copy link
Member

Concretely, one optimization would be in NfaStateHandler

Thanks. Let's look at that subsequently. I didn't apply it for now as that would also require changing the main loop that assumes NFA states are never initial and thus only looks to see whether the current state is a DFA initial state.

@veanes
Copy link
Contributor Author

veanes commented Feb 25, 2022

Sounds good, just wanted to point this out, to keep this in mind.

veanes and others added 2 commits February 25, 2022 19:38
…changes

This does a few things:
- The previous changes for Antimirov changed the representation from being represented as a DfaMatchingState to instead being a list of states.  Unfortunately, this came at the expense of a significant increase in overhead for the DFA mode.  This commit melds the old and new design, by still using a CurrentState struct that can be either a DFA state or an NFA state, but with a simplified representation that lacks any processing.  The processing is left to two separate structs that implement static abstract interface methods, as was previously done, using that to provide streamlined, generic-specialized processing for the DFA path.
- Overhauled how we switch from DFA mode to NFA mode.  Previously, the builder would store whether we were in NFA mode, and any matching after it had been switched over immediately started in NFA mode.  We now always start in DFA, and only switch over to NFA if we try to create a new state but doing so would exceed the graph size constraint.
- Overhauled how we handle the outer/inner loop logic.  Previously we wanted to remain in a tight inner loop and so would use some "leeway" to allow us to go beyond the graph size constraint in order to avoid having to check too frequently whether we'd exceeded that size.  Now, the transition function itself returns whether we need to exit out to the outer loop in order to upgrade from DFA to NFA, and until/unless that happens, we can stay in the inner loop.
- Avoids inlining huge amounts of NFA code that can result in worse DFA code
- Renamed Brzozowski and Antimirov to DFA and NFA.
- Added a lot of comments to SymbolicRegexMatcher
- Fixes "watchdog" issue with captures
- A few other renames / cleanups.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize symbolic regex Antimirov mode to avoid so much allocation

4 participants