Improve vectorization of MemoryExtensions.IndexOf(..., span)

Today `MemoryExtensions.IndexOf(..., ReadOnlySpan<char>)` is implemented essentially as a loop around an IndexOf for the first character in the pattern and then a SequenceEqual at that position for that pattern:
https://github.com/dotnet/runtime/blob/c742923991c68808d55a9f7cfcd7796644fea44f/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L17-L57
This is good but not great.  In particular it ends up breaking out of the vectorized search loop potentially much more frequently than is necessary and paying the startup overhead for the SequenceEqual each time.

There are better options.  For example, the simple algorithm described in [Algorithm 1: Generic SIMD](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd) searches concurrently for two characters rather than just one, which makes it much less likely to break out of the vectorized loop to early (this is what the Rust sliceslice crate uses; it's also what the Rust memchr crate uses (which is what's now used by its regex engine), though that memchr doesn't just use the first and last characters but instead picks two characters expected to have the lowest frequency of occurrence).

We could also special-case various sizes.  The aforementioned memchr, for example,  uses Rabin-Karp for small input strings and Two-Way for larger search strings. 

cc: @tannergooding, @GrabYourPitchforks 

	public static int IndexOf(ref char searchSpace, int searchSpaceLength, ref char value, int valueLength)
	{
	Debug.Assert(searchSpaceLength >= 0);
	Debug.Assert(valueLength >= 0);

	if (valueLength == 0)
	return 0; // A zero-length sequence is always treated as "found" at the start of the search space.

	char valueHead = value;
	ref char valueTail = ref Unsafe.Add(ref value, 1);
	int valueTailLength = valueLength - 1;
	int remainingSearchSpaceLength = searchSpaceLength - valueTailLength;

	int index = 0;
	while (remainingSearchSpaceLength > 0)
	{
	// Do a quick search for the first element of "value".
	int relativeIndex = IndexOf(ref Unsafe.Add(ref searchSpace, index), valueHead, remainingSearchSpaceLength);
	if (relativeIndex == -1)
	break;

	remainingSearchSpaceLength -= relativeIndex;
	index += relativeIndex;

	if (remainingSearchSpaceLength <= 0)
	break; // The unsearched portion is now shorter than the sequence we're looking for. So it can't be there.

	// Found the first element of "value". See if the tail matches.
	if (SequenceEqual(
	ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, index + 1)),
	ref Unsafe.As<char, byte>(ref valueTail),
	(nuint)(uint)valueTailLength * 2))
	{
	return index; // The tail matched. Return a successful find.
	}

	remainingSearchSpaceLength--;
	index++;
	}
	return -1;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve vectorization of MemoryExtensions.IndexOf(..., span) #60866

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve vectorization of MemoryExtensions.IndexOf(..., span) #60866

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions