Skip to content

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

@EgorBo

Description

@EgorBo

It's a quite popular case when we compare a string against some predefined one using Ordinal/OrdinalIgnoreCase case (I found plenty of cases in dotnet/aspnetcore and dotnet/runtime repos). E.g. patterns like:

* string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* str.Equals("cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* StringComparer.OrdinalIgnoreCase.Equals(str, "cns"); // or just Ordinal

So Roslyn/ILLink/JIT (perhaps, it should be JIT to handle more cases after inlining + Roslyn/ILLink know nothing about target arch) could optimize such comparisons by inlining a more optimized check, e.g. here is what we can do for small strings (1-4 chars) keeping in mind that strings are 8-bytes aligned (for simplicity I only cared about AMD64 arch):

[Benchmark(Baseline = true)]
[Arguments("true")]
[Arguments("TRUE")]
public bool StringEquals(string str)
{
    return string.Equals(str, "True", StringComparison.OrdinalIgnoreCase);
}

[Benchmark]
[Arguments("true")]
[Arguments("TRUE")]
public bool Inlined(string str)
{
    return object.ReferenceEquals(str, "True") || 
           (str.Length == 4 &&

            // string's content fits into 64bit register, the following code looks awful
            // but it's a simple 'or + cmp' in the codegen
            (Unsafe.ReadUnaligned<ulong>(ref Unsafe.As<char, byte>(
                   ref MemoryMarshal.GetReference(str.AsSpan()))) | 0x0020002000200020 /*'to upper' since the const arg doesn't contain anything but [a..Z]*/) == 0x65007500720074);
}
|       Method |  str |      Mean |     Error |    StdDev | Ratio |
|------------- |----- |----------:|----------:|----------:|------:|
| StringEquals | TRUE | 4.1723 ns | 0.0023 ns | 0.0020 ns |  1.00 |
|      Inlined | TRUE | 0.3416 ns | 0.0024 ns | 0.0022 ns |  0.08 |
|              |      |           |           |           |       |
| StringEquals | true | 4.1718 ns | 0.0023 ns | 0.0021 ns |  1.00 |
|      Inlined | true | 0.3406 ns | 0.0019 ns | 0.0016 ns |  0.08 |

We also can inline SIMD stuff for longer strings, e.g. here I check that an http header is "Proxy-Authenticate" (18 chars) using two AVX2 vectors: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44

|           Method |         headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY-AUTHENTICATE | 10.296 ns | 0.0087 ns | 0.0078 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY-AUTHENTICATE |  2.560 ns | 0.0042 ns | 0.0038 ns |  0.25 |     - |     - |     - |         - |
|                  |                    |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy-authenticate | 10.298 ns | 0.0078 ns | 0.0065 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy-authenticate |  2.563 ns | 0.0071 ns | 0.0066 ns |  0.25 |     - |     - |     - |         - |

when the input string is let's say 30 bytes the results are even better:

|           Method |           headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |--------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY(...)WORLD [30] | 15.396 ns | 0.0141 ns | 0.0117 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY(...)WORLD [30] |  2.558 ns | 0.0010 ns | 0.0008 ns |  0.17 |     - |     - |     - |         - |
|                  |                      |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy(...)World [30] | 14.997 ns | 0.0089 ns | 0.0079 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy(...)World [30] |  2.550 ns | 0.0038 ns | 0.0030 ns |  0.17 |     - |     - |     - |         - |

So for [0..32] chars (string.Length) we can emit an inlined super-fast comparison:
[0..4]: using a single 64bit GP register
[5..8]: using two 64bit GP registers
[9..16]: using two 128bit vectors
[17..32]: using two 256bit vectors
[33...]: leave as is.

/cc @GrabYourPitchforks @benaadams @jkotas @stephentoub

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions