-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Today MemoryExtensions.IndexOf(..., ReadOnlySpan<char>) is implemented essentially as a loop around an IndexOf for the first character in the pattern and then a SequenceEqual at that position for that pattern:
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Lines 17 to 57 in c742923
| public static int IndexOf(ref char searchSpace, int searchSpaceLength, ref char value, int valueLength) | |
| { | |
| Debug.Assert(searchSpaceLength >= 0); | |
| Debug.Assert(valueLength >= 0); | |
| if (valueLength == 0) | |
| return 0; // A zero-length sequence is always treated as "found" at the start of the search space. | |
| char valueHead = value; | |
| ref char valueTail = ref Unsafe.Add(ref value, 1); | |
| int valueTailLength = valueLength - 1; | |
| int remainingSearchSpaceLength = searchSpaceLength - valueTailLength; | |
| int index = 0; | |
| while (remainingSearchSpaceLength > 0) | |
| { | |
| // Do a quick search for the first element of "value". | |
| int relativeIndex = IndexOf(ref Unsafe.Add(ref searchSpace, index), valueHead, remainingSearchSpaceLength); | |
| if (relativeIndex == -1) | |
| break; | |
| remainingSearchSpaceLength -= relativeIndex; | |
| index += relativeIndex; | |
| if (remainingSearchSpaceLength <= 0) | |
| break; // The unsearched portion is now shorter than the sequence we're looking for. So it can't be there. | |
| // Found the first element of "value". See if the tail matches. | |
| if (SequenceEqual( | |
| ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, index + 1)), | |
| ref Unsafe.As<char, byte>(ref valueTail), | |
| (nuint)(uint)valueTailLength * 2)) | |
| { | |
| return index; // The tail matched. Return a successful find. | |
| } | |
| remainingSearchSpaceLength--; | |
| index++; | |
| } | |
| return -1; | |
| } |
This is good but not great. In particular it ends up breaking out of the vectorized search loop potentially much more frequently than is necessary and paying the startup overhead for the SequenceEqual each time.
There are better options. For example, the simple algorithm described in Algorithm 1: Generic SIMD searches concurrently for two characters rather than just one, which makes it much less likely to break out of the vectorized loop to early (this is what the Rust sliceslice crate uses; it's also what the Rust memchr crate uses (which is what's now used by its regex engine), though that memchr doesn't just use the first and last characters but instead picks two characters expected to have the lowest frequency of occurrence).
We could also special-case various sizes. The aforementioned memchr, for example, uses Rabin-Karp for small input strings and Two-Way for larger search strings.