-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Several improvements to regex source generator #59660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For strings of length 2 we weren't using the optimized path of reading/comparing a single uint.
When we have a sequence like `abc[def]ghi`, today we'll emit a length check of 3 for the `abc` multi, a length check of 1 for the `[def]` set, and a length check of 3 for the `ghi`, but we can instead emit a single length check for 7. This also fixes a small bug where I used a > instead of a >= in a length comparison, the net effect of which was using two comparisons instead of getting away with one for a two character string. And added some debug-only output that helped me diagnosing why things were working the way they were.
For an expression like "abc", if FindFirstChar returns true, it's found and fully validated the match. There's no benefit to comparing it again in Go; instead, Go can simply perform the capture.
|
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsWith the regex source generator spitting out C#, various things pop as to places we can make simple improvements. Commits:
protected override bool FindFirstChar()
{
string runtext = base.runtext!;
int runtextpos = base.runtextpos;
int runtextend = base.runtextend;
int ch;
// Minimum required length check
if (runtextpos <= runtextend - 6)
{
// Boyer-Moore prefix matching
runtextpos += 5;
int offset = 0;
goto Start;
DefaultAdvance:
offset = 6;
Advance:
runtextpos += offset;
Start:
if (runtextpos >= runtextend)
{
goto ReturnFalse;
}
ch = runtext[runtextpos];
if (ch == 'f')
{
goto PartialMatch;
}
ch -= 97;
if ((uint)ch > 5)
{
goto DefaultAdvance;
}
offset = "\u0005\u0004\u0003\u0002\u0001\0"[ch];
goto Advance;
PartialMatch:
int test = runtextpos;
if (runtext[--test] == 'e')
{
goto L0;
}
L1:
offset = 1;
goto Advance;
L0:
if (runtext[--test] != 'd')
{
goto L1;
}
if (runtext[--test] != 'c')
{
goto L1;
}
if (runtext[--test] != 'b')
{
goto L1;
}
if (runtext[--test] != 'a')
{
goto L1;
}
base.runtextpos = test;
return true;
}
// No match
ReturnFalse:
base.runtextpos = runtextend;
return false;
}After: protected override bool FindFirstChar()
{
string runtext = base.runtext!;
int runtextpos = base.runtextpos;
int runtextend = base.runtextend;
int ch;
// Minimum required length check
if (runtextpos <= runtextend - 6)
{
// Boyer-Moore prefix matching
runtextpos += 5;
while (runtextpos < runtextend)
{
ch = runtext[runtextpos];
if (ch != 'f')
{
ch -= 'a';
if ((uint)ch > ('f' - 'a'))
{
runtextpos += 6;
continue;
}
runtextpos += "\u0005\u0004\u0003\u0002\u0001\0"[ch];
continue;
}
int test = runtextpos;
if (runtext[--test] != 'e' ||
runtext[--test] != 'd' ||
runtext[--test] != 'c' ||
runtext[--test] != 'b' ||
runtext[--test] != 'a')
{
runtextpos++;
continue;
}
base.runtextpos = test;
return true;
}
}
// No match
ReturnFalse:
base.runtextpos = runtextend;
return false;
}
|
|
@eerhardt, mind reviewing? |
eerhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
| node = node.Child(0); | ||
|
|
||
| // In some limited cases, FindFirstChar will only return true if it successfully matched the whole thing. | ||
| // This is the case, in particular, for strings. We can special case these to do essentially nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does
This is the case, in particular, for strings.
mean? Did you mean to describe the kind of strings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have used the word "multis" if that would be clearer (it maps to the term used in the code), or "multiple characters in a sequence", e.g. in [ab][cd]efghi[jk], the "efghi" sequence of multiple characters next to each other.
|
Thanks for taking a look. |
With the regex source generator spitting out C#, various things pop as to places we can make simple improvements.
Commits:
Just fixes a simple comparison that was emitted with ">" when it should have been ">=". The effect of not using ">=" was we doing more comparisons than strictly necessary in some situations when comparing multiple characters in a regex pattern.
Given a pattern like
abc[def]ghi, the RegexOptions.Compiled engine and source generator would emit a length check for the "abc", a length check for the "[def]", and a length check for the "ghi", but they can trivially be combined into a single length check that covers all three portions. After fixing this in the source generator, I ported it back to Compiled.For simple regexes that are just looking for a string, e.g. "abc", we currently end up doing the full comparison to find it as part of FindFirstChar, but then we re-compare it again in Go. That's wasteful; in such a situation, Go can simply perform the capture and doesn't need to compare anything. After fixing this in the source generator, I ported it back to Compiled.
The Boyer-Moore code we spit out in FindFirstChar is full of labels and gotos. For the source generator, we can make it easier to read. e.g. for the pattern "abcdef":
Before:
After: