Skip to content

Conversation

@stephentoub
Copy link
Member

With the regex source generator spitting out C#, various things pop as to places we can make simple improvements.

Commits:

  1. Just fixes a simple comparison that was emitted with ">" when it should have been ">=". The effect of not using ">=" was we doing more comparisons than strictly necessary in some situations when comparing multiple characters in a regex pattern.

  2. Given a pattern like abc[def]ghi, the RegexOptions.Compiled engine and source generator would emit a length check for the "abc", a length check for the "[def]", and a length check for the "ghi", but they can trivially be combined into a single length check that covers all three portions. After fixing this in the source generator, I ported it back to Compiled.

  3. For simple regexes that are just looking for a string, e.g. "abc", we currently end up doing the full comparison to find it as part of FindFirstChar, but then we re-compare it again in Go. That's wasteful; in such a situation, Go can simply perform the capture and doesn't need to compare anything. After fixing this in the source generator, I ported it back to Compiled.

  4. The Boyer-Moore code we spit out in FindFirstChar is full of labels and gotos. For the source generator, we can make it easier to read. e.g. for the pattern "abcdef":
    Before:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;
                    
    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        int offset = 0;
        goto Start;
                        
        DefaultAdvance:
        offset = 6;
                        
        Advance:
        runtextpos += offset;
                        
        Start:
        if (runtextpos >= runtextend)
        {
            goto ReturnFalse;
        }
                        
        ch = runtext[runtextpos];
        if (ch == 'f')
        {
            goto PartialMatch;
        }
        ch -= 97;
        if ((uint)ch > 5)
        {
            goto DefaultAdvance;
        }
        offset = "\u0005\u0004\u0003\u0002\u0001\0"[ch];
        goto Advance;
                        
        PartialMatch:
        int test = runtextpos;
        if (runtext[--test] == 'e')
        {
            goto L0;
        }
                        
        L1:
        offset = 1;
        goto Advance;
                        
        L0:
        if (runtext[--test] != 'd')
        {
            goto L1;
        }
        if (runtext[--test] != 'c')
        {
            goto L1;
        }
        if (runtext[--test] != 'b')
        {
            goto L1;
        }
        if (runtext[--test] != 'a')
        {
            goto L1;
        }
                        
        base.runtextpos = test;
        return true;
    }
                    
    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

After:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;

    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        while (runtextpos < runtextend)
        {
            ch = runtext[runtextpos];
            if (ch != 'f')
            {
                ch -= 'a';
                if ((uint)ch > ('f' - 'a'))
                {
                    runtextpos += 6;
                    continue;
                }
                runtextpos += "\u0005\u0004\u0003\u0002\u0001\0"[ch];
                continue;
            }

            int test = runtextpos;

            if (runtext[--test] != 'e' ||
                runtext[--test] != 'd' ||
                runtext[--test] != 'c' ||
                runtext[--test] != 'b' ||
                runtext[--test] != 'a')
            {
                runtextpos++;
                continue;
            }

            base.runtextpos = test;
            return true;
        }
    }

    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

For strings of length 2 we weren't using the optimized path of reading/comparing a single uint.
When we have a sequence like `abc[def]ghi`, today we'll emit a length check of 3 for the `abc` multi, a length check of 1 for the `[def]` set, and a length check of 3 for the `ghi`, but we can instead emit a single length check for 7.

This also fixes a small bug where I used a > instead of a >= in a length comparison, the net effect of which was using two comparisons instead of getting away with one for a two character string.

And added some debug-only output that helped me diagnosing why things were working the way they were.
For an expression like "abc", if FindFirstChar returns true, it's found and fully validated the match.  There's no benefit to comparing it again in Go; instead, Go can simply perform the capture.
@ghost
Copy link

ghost commented Sep 27, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

With the regex source generator spitting out C#, various things pop as to places we can make simple improvements.

Commits:

  1. Just fixes a simple comparison that was emitted with ">" when it should have been ">=". The effect of not using ">=" was we doing more comparisons than strictly necessary in some situations when comparing multiple characters in a regex pattern.

  2. Given a pattern like abc[def]ghi, the RegexOptions.Compiled engine and source generator would emit a length check for the "abc", a length check for the "[def]", and a length check for the "ghi", but they can trivially be combined into a single length check that covers all three portions. After fixing this in the source generator, I ported it back to Compiled.

  3. For simple regexes that are just looking for a string, e.g. "abc", we currently end up doing the full comparison to find it as part of FindFirstChar, but then we re-compare it again in Go. That's wasteful; in such a situation, Go can simply perform the capture and doesn't need to compare anything. After fixing this in the source generator, I ported it back to Compiled.

  4. The Boyer-Moore code we spit out in FindFirstChar is full of labels and gotos. For the source generator, we can make it easier to read. e.g. for the pattern "abcdef":
    Before:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;
                    
    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        int offset = 0;
        goto Start;
                        
        DefaultAdvance:
        offset = 6;
                        
        Advance:
        runtextpos += offset;
                        
        Start:
        if (runtextpos >= runtextend)
        {
            goto ReturnFalse;
        }
                        
        ch = runtext[runtextpos];
        if (ch == 'f')
        {
            goto PartialMatch;
        }
        ch -= 97;
        if ((uint)ch > 5)
        {
            goto DefaultAdvance;
        }
        offset = "\u0005\u0004\u0003\u0002\u0001\0"[ch];
        goto Advance;
                        
        PartialMatch:
        int test = runtextpos;
        if (runtext[--test] == 'e')
        {
            goto L0;
        }
                        
        L1:
        offset = 1;
        goto Advance;
                        
        L0:
        if (runtext[--test] != 'd')
        {
            goto L1;
        }
        if (runtext[--test] != 'c')
        {
            goto L1;
        }
        if (runtext[--test] != 'b')
        {
            goto L1;
        }
        if (runtext[--test] != 'a')
        {
            goto L1;
        }
                        
        base.runtextpos = test;
        return true;
    }
                    
    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}

After:

protected override bool FindFirstChar()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int ch;

    // Minimum required length check
    if (runtextpos <= runtextend - 6)
    {
        // Boyer-Moore prefix matching
        runtextpos += 5;
        while (runtextpos < runtextend)
        {
            ch = runtext[runtextpos];
            if (ch != 'f')
            {
                ch -= 'a';
                if ((uint)ch > ('f' - 'a'))
                {
                    runtextpos += 6;
                    continue;
                }
                runtextpos += "\u0005\u0004\u0003\u0002\u0001\0"[ch];
                continue;
            }

            int test = runtextpos;

            if (runtext[--test] != 'e' ||
                runtext[--test] != 'd' ||
                runtext[--test] != 'c' ||
                runtext[--test] != 'b' ||
                runtext[--test] != 'a')
            {
                runtextpos++;
                continue;
            }

            base.runtextpos = test;
            return true;
        }
    }

    // No match
    ReturnFalse:
    base.runtextpos = runtextend;
    return false;
}
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub stephentoub added this to the 7.0.0 milestone Sep 27, 2021
@stephentoub
Copy link
Member Author

@eerhardt, mind reviewing?

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

node = node.Child(0);

// In some limited cases, FindFirstChar will only return true if it successfully matched the whole thing.
// This is the case, in particular, for strings. We can special case these to do essentially nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does

This is the case, in particular, for strings.

mean? Did you mean to describe the kind of strings here?

Copy link
Member Author

@stephentoub stephentoub Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have used the word "multis" if that would be clearer (it maps to the term used in the code), or "multiple characters in a sequence", e.g. in [ab][cd]efghi[jk], the "efghi" sequence of multiple characters next to each other.

@stephentoub
Copy link
Member Author

Thanks for taking a look.

@stephentoub stephentoub merged commit 909cddf into dotnet:main Sep 30, 2021
@stephentoub stephentoub deleted the regextweaks branch September 30, 2021 16:54
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants