Several improvements to regex source generator #59660

stephentoub · 2021-09-27T18:02:28Z

With the regex source generator spitting out C#, various things pop as to places we can make simple improvements.

Commits:

Just fixes a simple comparison that was emitted with ">" when it should have been ">=". The effect of not using ">=" was we doing more comparisons than strictly necessary in some situations when comparing multiple characters in a regex pattern.
Given a pattern like abc[def]ghi, the RegexOptions.Compiled engine and source generator would emit a length check for the "abc", a length check for the "[def]", and a length check for the "ghi", but they can trivially be combined into a single length check that covers all three portions. After fixing this in the source generator, I ported it back to Compiled.
For simple regexes that are just looking for a string, e.g. "abc", we currently end up doing the full comparison to find it as part of FindFirstChar, but then we re-compare it again in Go. That's wasteful; in such a situation, Go can simply perform the capture and doesn't need to compare anything. After fixing this in the source generator, I ported it back to Compiled.
The Boyer-Moore code we spit out in FindFirstChar is full of labels and gotos. For the source generator, we can make it easier to read. e.g. for the pattern "abcdef":
Before:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;
                    
    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        int offset = 0;
        goto Start;
                        
        DefaultAdvance:
        offset = 6;
                        
        Advance:
        runtextpos += offset;
                        
        Start:
        if (runtextpos >= runtextend)
        {
            goto ReturnFalse;
        }
                        
        ch = runtext[runtextpos];
        if (ch == 'f')
        {
            goto PartialMatch;
        }
        ch -= 97;
        if ((uint)ch > 5)
        {
            goto DefaultAdvance;
        }
        offset = "\u0005\u0004\u0003\u0002\u0001\0"[ch];
        goto Advance;
                        
        PartialMatch:
        int test = runtextpos;
        if (runtext[--test] == 'e')
        {
            goto L0;
        }
                        
        L1:
        offset = 1;
        goto Advance;
                        
        L0:
        if (runtext[--test] != 'd')
        {
            goto L1;
        }
        if (runtext[--test] != 'c')
        {
            goto L1;
        }
        if (runtext[--test] != 'b')
        {
            goto L1;
        }
        if (runtext[--test] != 'a')
        {
            goto L1;
        }
                        
        base.runtextpos = test;
        return true;
    }
                    
    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

After:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;

    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        while (runtextpos < runtextend)
        {
            ch = runtext[runtextpos];
            if (ch != 'f')
            {
                ch -= 'a';
                if ((uint)ch > ('f' - 'a'))
                {
                    runtextpos += 6;
                    continue;
                }
                runtextpos += "\u0005\u0004\u0003\u0002\u0001\0"[ch];
                continue;
            }

            int test = runtextpos;

            if (runtext[--test] != 'e' ||
                runtext[--test] != 'd' ||
                runtext[--test] != 'c' ||
                runtext[--test] != 'b' ||
                runtext[--test] != 'a')
            {
                runtextpos++;
                continue;
            }

            base.runtextpos = test;
            return true;
        }
    }

    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

For strings of length 2 we weren't using the optimized path of reading/comparing a single uint.

When we have a sequence like `abc[def]ghi`, today we'll emit a length check of 3 for the `abc` multi, a length check of 1 for the `[def]` set, and a length check of 3 for the `ghi`, but we can instead emit a single length check for 7. This also fixes a small bug where I used a > instead of a >= in a length comparison, the net effect of which was using two comparisons instead of getting away with one for a two character string. And added some debug-only output that helped me diagnosing why things were working the way they were.

For an expression like "abc", if FindFirstChar returns true, it's found and fully validated the match. There's no benefit to comparing it again in Go; instead, Go can simply perform the capture.

ghost · 2021-09-27T18:02:36Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

With the regex source generator spitting out C#, various things pop as to places we can make simple improvements.

Commits:

Just fixes a simple comparison that was emitted with ">" when it should have been ">=". The effect of not using ">=" was we doing more comparisons than strictly necessary in some situations when comparing multiple characters in a regex pattern.
Given a pattern like abc[def]ghi, the RegexOptions.Compiled engine and source generator would emit a length check for the "abc", a length check for the "[def]", and a length check for the "ghi", but they can trivially be combined into a single length check that covers all three portions. After fixing this in the source generator, I ported it back to Compiled.
For simple regexes that are just looking for a string, e.g. "abc", we currently end up doing the full comparison to find it as part of FindFirstChar, but then we re-compare it again in Go. That's wasteful; in such a situation, Go can simply perform the capture and doesn't need to compare anything. After fixing this in the source generator, I ported it back to Compiled.
The Boyer-Moore code we spit out in FindFirstChar is full of labels and gotos. For the source generator, we can make it easier to read. e.g. for the pattern "abcdef":
Before:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;
                    
    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        int offset = 0;
        goto Start;
                        
        DefaultAdvance:
        offset = 6;
                        
        Advance:
        runtextpos += offset;
                        
        Start:
        if (runtextpos >= runtextend)
        {
            goto ReturnFalse;
        }
                        
        ch = runtext[runtextpos];
        if (ch == 'f')
        {
            goto PartialMatch;
        }
        ch -= 97;
        if ((uint)ch > 5)
        {
            goto DefaultAdvance;
        }
        offset = "\u0005\u0004\u0003\u0002\u0001\0"[ch];
        goto Advance;
                        
        PartialMatch:
        int test = runtextpos;
        if (runtext[--test] == 'e')
        {
            goto L0;
        }
                        
        L1:
        offset = 1;
        goto Advance;
                        
        L0:
        if (runtext[--test] != 'd')
        {
            goto L1;
        }
        if (runtext[--test] != 'c')
        {
            goto L1;
        }
        if (runtext[--test] != 'b')
        {
            goto L1;
        }
        if (runtext[--test] != 'a')
        {
            goto L1;
        }
                        
        base.runtextpos = test;
        return true;
    }
                    
    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

After:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;

    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        while (runtextpos < runtextend)
        {
            ch = runtext[runtextpos];
            if (ch != 'f')
            {
                ch -= 'a';
                if ((uint)ch > ('f' - 'a'))
                {
                    runtextpos += 6;
                    continue;
                }
                runtextpos += "\u0005\u0004\u0003\u0002\u0001\0"[ch];
                continue;
            }

            int test = runtextpos;

            if (runtext[--test] != 'e' ||
                runtext[--test] != 'd' ||
                runtext[--test] != 'c' ||
                runtext[--test] != 'b' ||
                runtext[--test] != 'a')
            {
                runtextpos++;
                continue;
            }

            base.runtextpos = test;
            return true;
        }
    }

    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

stephentoub · 2021-09-30T16:08:24Z

@eerhardt, mind reviewing?

eerhardt

Looks fine to me.

eerhardt · 2021-09-30T16:47:39Z

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

            node = node.Child(0);

+            // In some limited cases, FindFirstChar will only return true if it successfully matched the whole thing.
+            // This is the case, in particular, for strings.  We can special case these to do essentially nothing


what does

This is the case, in particular, for strings.

mean? Did you mean to describe the kind of strings here?

I could have used the word "multis" if that would be clearer (it maps to the term used in the code), or "multiple characters in a sequence", e.g. in [ab][cd]efghi[jk], the "efghi" sequence of multiple characters next to each other.

stephentoub · 2021-09-30T16:54:01Z

Thanks for taking a look.

stephentoub added 4 commits September 27, 2021 12:59

Fix comparison in EmitMultiChar

5e0c182

For strings of length 2 we weren't using the optimized path of reading/comparing a single uint.

Special-case regexes where Go can be a nop

3549a69

For an expression like "abc", if FindFirstChar returns true, it's found and fully validated the match. There's no benefit to comparing it again in Go; instead, Go can simply perform the capture.

Clean up generated Boyer-Moore code

0e06dc6

ghost added the area-System.Text.RegularExpressions label Sep 27, 2021

stephentoub added this to the 7.0.0 milestone Sep 27, 2021

eerhardt approved these changes Sep 30, 2021

View reviewed changes

stephentoub merged commit 909cddf into dotnet:main Sep 30, 2021

stephentoub deleted the regextweaks branch September 30, 2021 16:54

stephentoub mentioned this pull request Oct 5, 2021

Improve regex reductions and code gen for some alternations #59903

Merged

ghost locked as resolved and limited conversation to collaborators Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Several improvements to regex source generator #59660

Several improvements to regex source generator #59660

Uh oh!

stephentoub commented Sep 27, 2021

Uh oh!

ghost commented Sep 27, 2021

Uh oh!

stephentoub commented Sep 30, 2021

Uh oh!

eerhardt left a comment

Uh oh!

eerhardt Sep 30, 2021

Uh oh!

stephentoub Sep 30, 2021 •

edited

Loading

Uh oh!

stephentoub commented Sep 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Several improvements to regex source generator #59660

Several improvements to regex source generator #59660

Uh oh!

Conversation

stephentoub commented Sep 27, 2021

Uh oh!

ghost commented Sep 27, 2021

Uh oh!

stephentoub commented Sep 30, 2021

Uh oh!

eerhardt left a comment

Choose a reason for hiding this comment

Uh oh!

eerhardt Sep 30, 2021

Choose a reason for hiding this comment

Uh oh!

stephentoub Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Sep 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephentoub Sep 30, 2021 •

edited

Loading