Faster IndexOf for arm64 #67811

EgorBo · 2022-04-09T18:11:52Z

Use a slightly faster "no matches" check in IndexOf and IndexOfAny.

Benchmark:

public class Benchmarks
{
    public IEnumerable<object[]> TestData()
    {
        // Small inputs which are handled via SIMD 
        yield return new object[] { "12345678", '1' };
        yield return new object[] { "123456789", '9' };
        yield return new object[] { "1234567812345678", '1' };
        yield return new object[] { "1234567812345678", '0' };
        yield return new object[] { "Console WriteLine Hello World", 'o' };
        yield return new object[] { "Console WriteLine Hello World", 'd' };

        // Large inputs 
        yield return new object[] { new string('x', 64), 'y' };
        yield return new object[] { new string('x', 200), 'y' };
        yield return new object[] { new string('x', 1000), 'y' };
        yield return new object[] { new string('x', 1000000), 'y' };
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public int IndexOf_byte(string str, char c)
    {
        return MemoryMarshal.Cast<char,byte>(str.AsSpan()).IndexOf((byte)c);
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public int IndexOf_char(string str, char c)
    {
        return str.AsSpan().IndexOf(c);
    }
}

IndexOf_byte

|  Method |               Toolchain |                  str | c |          Mean |      Error |     StdDev | Ratio |
|-------- |------------------------ |--------------------- |-- |--------------:|-----------:|-----------:|------:|
| IndexOf |      /Core_Root/corerun |             12345678 | 1 |      1.692 ns |  0.0022 ns |  0.0020 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |             12345678 | 1 |      1.686 ns |  0.0017 ns |  0.0013 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     10.262 ns |  0.0127 ns |  0.0119 ns |  1.78 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |      5.763 ns |  0.0030 ns |  0.0025 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 1 |      1.676 ns |  0.0032 ns |  0.0030 ns |  1.01 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 1 |      1.665 ns |  0.0022 ns |  0.0020 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |            123456789 | 9 |      4.197 ns |  0.0089 ns |  0.0083 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |            123456789 | 9 |      4.193 ns |  0.0032 ns |  0.0029 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | d |      7.010 ns |  0.0049 ns |  0.0043 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | d |      7.014 ns |  0.0018 ns |  0.0017 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | o |      1.679 ns |  0.0006 ns |  0.0006 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | o |      1.678 ns |  0.0008 ns |  0.0007 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxxx(...)xxxxx [64] | y |      7.322 ns |  0.0028 ns |  0.0024 ns |  0.80 |
| IndexOf | /Core_Root_base/corerun | xxxxx(...)xxxxx [64] | y |      9.204 ns |  0.0032 ns |  0.0029 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xxxx(...)xxxx [200] | y |     13.269 ns |  0.0024 ns |  0.0022 ns |  0.62 |
| IndexOf | /Core_Root_base/corerun |  xxxx(...)xxxx [200] | y |     21.317 ns |  0.2285 ns |  0.2138 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxx(...)xxxx [1000] | y |     58.107 ns |  0.0458 ns |  0.0429 ns |  0.65 |
| IndexOf | /Core_Root_base/corerun | xxxx(...)xxxx [1000] | y |     89.510 ns |  0.0272 ns |  0.0241 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xx(...)xx [1000000] | y | 44,056.624 ns |  9.0990 ns |  8.5112 ns |  0.56 |
| IndexOf | /Core_Root_base/corerun |  xx(...)xx [1000000] | y | 78,269.573 ns | 26.4061 ns | 22.0503 ns |  1.00 |

IndexOf_char

|  Method |               Toolchain |                  str | c |          Mean |      Error |     StdDev | Ratio |
|-------- |------------------------ |--------------------- |-- |--------------:|-----------:|-----------:|------:|
| IndexOf |      /Core_Root/corerun |             12345678 | 1 |      1.382 ns |  0.0020 ns |  0.0018 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |             12345678 | 1 |      1.382 ns |  0.0023 ns |  0.0021 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     11.378 ns |  0.0141 ns |  0.0132 ns |  1.07 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |     10.634 ns |  0.0240 ns |  0.0187 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 1 |      1.375 ns |  0.0016 ns |  0.0014 ns |  1.01 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 1 |      1.366 ns |  0.0034 ns |  0.0030 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |            123456789 | 9 |      2.612 ns |  0.0014 ns |  0.0011 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |            123456789 | 9 |      2.615 ns |  0.0029 ns |  0.0027 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | d |      5.784 ns |  0.0043 ns |  0.0038 ns |  0.91 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | d |      6.389 ns |  0.0059 ns |  0.0052 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | o |      1.689 ns |  0.0016 ns |  0.0014 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | o |      1.694 ns |  0.0007 ns |  0.0007 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxxx(...)xxxxx [64] | y |      7.009 ns |  0.0025 ns |  0.0023 ns |  0.76 |
| IndexOf | /Core_Root_base/corerun | xxxxx(...)xxxxx [64] | y |      9.204 ns |  0.0025 ns |  0.0023 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xxxx(...)xxxx [200] | y |     14.192 ns |  0.0070 ns |  0.0062 ns |  0.67 |
| IndexOf | /Core_Root_base/corerun |  xxxx(...)xxxx [200] | y |     21.267 ns |  0.2695 ns |  0.2521 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxx(...)xxxx [1000] | y |     64.790 ns |  0.2031 ns |  0.1800 ns |  0.72 |
| IndexOf | /Core_Root_base/corerun | xxxx(...)xxxx [1000] | y |     89.485 ns |  0.0196 ns |  0.0174 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xx(...)xx [1000000] | y | 60,098.282 ns |  3.1361 ns |  2.7801 ns |  0.77 |
| IndexOf | /Core_Root_base/corerun |  xx(...)xx [1000000] | y | 78,264.715 ns | 20.9694 ns | 19.6148 ns |  1.00 |

up to +40% faster for large inputs can 0.5-1ns regress in the worst case - if input is found in the very first vector.

There are weird regressions

| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     10.262 ns |  0.0127 ns |  0.0119 ns |  1.78 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |      5.763 ns |  0.0030 ns |  0.0025 ns |  1.00 |

| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     11.378 ns |  0.0141 ns |  0.0132 ns |  1.07 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |     10.634 ns |  0.0240 ns |  0.0187 ns |  1.00 |

but it goes away in FullPGO mode (and the PR becomes faster than the base as expected in this case) - it means the loop is not properly aligned by default. So the regression will be fixed once we update *.mibc for corelib (IndexOf definitely participates in the trainings)

ghost · 2022-04-09T18:12:00Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Use a slightly faster "no matches" check in IndexOf and IndexOfAny.

Benchmark:

public class Benchmarks
{
    public IEnumerable<object[]> TestData()
    {
        // Small inputs which are handled via SIMD 
        yield return new object[] { "12345678", '1' };
        yield return new object[] { "123456789", '9' };
        yield return new object[] { "1234567812345678", '1' };
        yield return new object[] { "1234567812345678", '0' };
        yield return new object[] { "Console WriteLine Hello World", 'o' };
        yield return new object[] { "Console WriteLine Hello World", 'd' };

        // Large inputs 
        yield return new object[] { new string('x', 64), 'y' };
        yield return new object[] { new string('x', 200), 'y' };
        yield return new object[] { new string('x', 1000), 'y' };
        yield return new object[] { new string('x', 1000000), 'y' };
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public int IndexOf_byte(string str, char c)
    {
        return MemoryMarshal.Cast<char,byte>(str.AsSpan()).IndexOf((byte)c);
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public int IndexOf_char(string str, char c)
    {
        return str.AsSpan().IndexOf(c);
    }
}

IndexOf_byte

|  Method |               Toolchain |                  str | c |          Mean |      Error |     StdDev | Ratio |
|-------- |------------------------ |--------------------- |-- |--------------:|-----------:|-----------:|------:|
| IndexOf |      /Core_Root/corerun |             12345678 | 1 |      1.692 ns |  0.0022 ns |  0.0020 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |             12345678 | 1 |      1.686 ns |  0.0017 ns |  0.0013 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     10.262 ns |  0.0127 ns |  0.0119 ns |  1.78 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |      5.763 ns |  0.0030 ns |  0.0025 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 1 |      1.676 ns |  0.0032 ns |  0.0030 ns |  1.01 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 1 |      1.665 ns |  0.0022 ns |  0.0020 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |            123456789 | 9 |      4.197 ns |  0.0089 ns |  0.0083 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |            123456789 | 9 |      4.193 ns |  0.0032 ns |  0.0029 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | d |      7.010 ns |  0.0049 ns |  0.0043 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | d |      7.014 ns |  0.0018 ns |  0.0017 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | o |      1.679 ns |  0.0006 ns |  0.0006 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | o |      1.678 ns |  0.0008 ns |  0.0007 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxxx(...)xxxxx [64] | y |      7.322 ns |  0.0028 ns |  0.0024 ns |  0.80 |
| IndexOf | /Core_Root_base/corerun | xxxxx(...)xxxxx [64] | y |      9.204 ns |  0.0032 ns |  0.0029 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xxxx(...)xxxx [200] | y |     13.269 ns |  0.0024 ns |  0.0022 ns |  0.62 |
| IndexOf | /Core_Root_base/corerun |  xxxx(...)xxxx [200] | y |     21.317 ns |  0.2285 ns |  0.2138 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxx(...)xxxx [1000] | y |     58.107 ns |  0.0458 ns |  0.0429 ns |  0.65 |
| IndexOf | /Core_Root_base/corerun | xxxx(...)xxxx [1000] | y |     89.510 ns |  0.0272 ns |  0.0241 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xx(...)xx [1000000] | y | 44,056.624 ns |  9.0990 ns |  8.5112 ns |  0.56 |
| IndexOf | /Core_Root_base/corerun |  xx(...)xx [1000000] | y | 78,269.573 ns | 26.4061 ns | 22.0503 ns |  1.00 |

IndexOf_char

|  Method |               Toolchain |                  str | c |          Mean |      Error |     StdDev | Ratio |
|-------- |------------------------ |--------------------- |-- |--------------:|-----------:|-----------:|------:|
| IndexOf |      /Core_Root/corerun |             12345678 | 1 |      1.382 ns |  0.0020 ns |  0.0018 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |             12345678 | 1 |      1.382 ns |  0.0023 ns |  0.0021 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 0 |     11.378 ns |  0.0141 ns |  0.0132 ns |  1.07 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 0 |     10.634 ns |  0.0240 ns |  0.0187 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |     1234567812345678 | 1 |      1.375 ns |  0.0016 ns |  0.0014 ns |  1.01 |
| IndexOf | /Core_Root_base/corerun |     1234567812345678 | 1 |      1.366 ns |  0.0034 ns |  0.0030 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |            123456789 | 9 |      2.612 ns |  0.0014 ns |  0.0011 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun |            123456789 | 9 |      2.615 ns |  0.0029 ns |  0.0027 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | d |      5.784 ns |  0.0043 ns |  0.0038 ns |  0.91 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | d |      6.389 ns |  0.0059 ns |  0.0052 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | Conso(...)World [29] | o |      1.689 ns |  0.0016 ns |  0.0014 ns |  1.00 |
| IndexOf | /Core_Root_base/corerun | Conso(...)World [29] | o |      1.694 ns |  0.0007 ns |  0.0007 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxxx(...)xxxxx [64] | y |      7.009 ns |  0.0025 ns |  0.0023 ns |  0.76 |
| IndexOf | /Core_Root_base/corerun | xxxxx(...)xxxxx [64] | y |      9.204 ns |  0.0025 ns |  0.0023 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xxxx(...)xxxx [200] | y |     14.192 ns |  0.0070 ns |  0.0062 ns |  0.67 |
| IndexOf | /Core_Root_base/corerun |  xxxx(...)xxxx [200] | y |     21.267 ns |  0.2695 ns |  0.2521 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun | xxxx(...)xxxx [1000] | y |     64.790 ns |  0.2031 ns |  0.1800 ns |  0.72 |
| IndexOf | /Core_Root_base/corerun | xxxx(...)xxxx [1000] | y |     89.485 ns |  0.0196 ns |  0.0174 ns |  1.00 |
|         |                         |                      |   |               |            |            |       |
| IndexOf |      /Core_Root/corerun |  xx(...)xx [1000000] | y | 60,098.282 ns |  3.1361 ns |  2.7801 ns |  0.77 |
| IndexOf | /Core_Root_base/corerun |  xx(...)xx [1000000] | y | 78,264.715 ns | 20.9694 ns | 19.6148 ns |  1.00 |

up +40% faster for large inputs can 0.5-1ns regress in the worst case - if input is found in the very first vector.

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-System.Memory`
Milestone:	-

kunalspathak

LGTM

dakersnar · 2022-04-21T20:57:43Z

Performance improvement on Windows arm64: dotnet/perf-autofiling-issues#4732

EgorBo added 2 commits April 9, 2022 19:58

Use faster "not found" checks on arm64

27ebdc1

Move mask to FindFirstMatchedLane

8ab651d

ghost assigned EgorBo Apr 9, 2022

ghost added the area-System.Memory label Apr 9, 2022

EgorBo force-pushed the arm-spanhelpers-fast branch from 9e59caa to 8ab651d Compare April 9, 2022 20:13

kunalspathak approved these changes Apr 11, 2022

View reviewed changes

EgorBo merged commit a4503d6 into dotnet:main Apr 12, 2022

adamsitnik added the tenet-performance Performance related issue label Apr 14, 2022

AndyAyersMS mentioned this pull request Apr 22, 2022

[Perf] Changes at 4/12/2022 3:59:36 PM dotnet/perf-autofiling-issues#4752

Closed

ghost locked as resolved and limited conversation to collaborators May 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster IndexOf for arm64 #67811

Faster IndexOf for arm64 #67811

Uh oh!

EgorBo commented Apr 9, 2022 •

edited

Loading

Uh oh!

ghost commented Apr 9, 2022

IndexOf_byte

IndexOf_char

Uh oh!

kunalspathak left a comment

Uh oh!

dakersnar commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Faster IndexOf for arm64 #67811

Faster IndexOf for arm64 #67811

Uh oh!

Conversation

EgorBo commented Apr 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

IndexOf_byte

IndexOf_char

Uh oh!

ghost commented Apr 9, 2022

IndexOf_byte

IndexOf_char

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EgorBo commented Apr 9, 2022 •

edited

Loading