Skip to content

Conversation

@benaadams
Copy link
Member

@benaadams benaadams commented Jan 17, 2020

Uses dotnet/runtime#1644 for the non-instrinsic path to eliminate range checks/branches

            Method | Branch |     Mean |     Error |         Op/s | Allocated | Precent |
------------------ |--------|---------:|----------:|-------------:|----------:|--------:|
 ConcatAsHexSuffix | master | 21.18 ns | 0.1040 ns | 47,221,986.9 |  34.33 MB |         |
 ConcatAsHexSuffix |     pr | 16.98 ns | 0.0905 ns | 58,881,989.3 |  34.33 MB | 124.7 % |

buffer[i + 3] = (char)hexEncodeMap[(number >> 20) & 0xF];
buffer[i + 2] = (char)hexEncodeMap[(number >> 24) & 0xF];
buffer[i + 1] = (char)hexEncodeMap[(number >> 28) & 0xF];
buffer[i] = tupleSeparator;
Copy link
Member

@EgorBo EgorBo Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be vectorized I think, load tupleNumber into an 256bit vector, do some shuffle magic and dump to a string (ideally - to FastAllocateString)

Copy link
Member

@gfoidl gfoidl Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastAllocateString is internal so it can't be used here in ASP.NET Core?

256bit vec is too large, 128bit should work, as there get 9 elements written to buffer. So with Vector128<ushor> it's 8 elements + the remaining 1.

Preferable this should be done in a different PR -- but nice idea!


Nitpicking: uint and little-endian 😉 (but that's irrelevant to the idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl yeah I meant Vector128 :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Makes it sense to add int -> hex to runtime? I think it's quite common to convert a number to hex. Therefore is int.ToString("X"), but it allocates a string. So there should be something like int.ToString(Span<char>, Format) (I know the API needs work, maybe on a better type like a special formatter) and at this central place all kind of optimization could take place (SIMD or SWAR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it sense to add int -> hex to runtime?

That conversation has been going on for years https://github.com/dotnet/corefx/issues/10013

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gowan then

            Method | Branch |     Mean |     Error |         Op/s | Allocated | Precent |
------------------ |--------|---------:|----------:|-------------:|----------:|--------:|
 ConcatAsHexSuffix | master | 21.18 ns | 0.1040 ns | 47,221,986.9 |  34.33 MB |         |
 ConcatAsHexSuffix |     pr | 17.26 ns | 0.0844 ns | 57,953,720.1 |  34.33 MB |   123 % |

Copy link
Member

@gfoidl gfoidl Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to search within my brain, but I believe there is a SWAR variant that needs less instructions than this SIMD variant. I'll give an update here...(later today).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That conversation has been going on for years

Maybe split the hex-part out of PopulateSpanWithHexSuffix to have such a method at least here in aspnetcore?

@benaadams benaadams changed the title Eliminate range checks from ConcatAsHexSuffix Eliminate range checks from ConcatAsHexSuffix + Instrinsics Jan 18, 2020
return string.Create(length, (str, separator, number), s_populateSpanWithHexSuffix);
}

private static void PopulateSpanWithHexSuffix(Span<char> buffer, (string str, char separator, uint number) tuple)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start a new comment to keep it clean.


I have a variant with SWAR

SWAR code
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void Number2HexSwar(int number, ref char dest)
{
    Debug.Assert(Bmi2.X64.IsSupported);

    ulong input = (ulong)(uint)number;      // cast to uint avoids movsxd and uses mov
    ulong isolatedNibbles = Bmi2.X64.ParallelBitDeposit(input, 0x0F0F0F0F_0F0F0F0Ful);

    Debug.Assert((isolatedNibbles & 0xF0F0F0F0_F0F0F0F0ul) == 0);

    // JIT will treat these as constants
    ulong packedAscii0 = Packed('0');
    ulong packedCorrection = Packed(0x80 - ('9' - '0' + 1));
    ulong packed0x80 = Packed(0x80);
    ulong packedHexShiftToAF = Packed('A' - '0' - 10);

    ulong asciiHexNumbers = isolatedNibbles + packedAscii0;
    ulong asciiHexChars = isolatedNibbles + packedCorrection;
    ulong flag = asciiHexChars & packed0x80;
    ulong hexAFMask = flag - (flag >> 7);

    ulong hex = asciiHexNumbers + (hexAFMask & packedHexShiftToAF);
    hex = BinaryPrimitives.ReverseEndianness(hex);

    ulong hex0 = hex;
    ulong hex1 = hex >> 32;

    Unsafe.Add(ref Unsafe.As<char, ulong>(ref dest), 0) = Bmi2.X64.ParallelBitDeposit(hex0, 0x00FF00FF_00FF00FFul);
    Unsafe.Add(ref Unsafe.As<char, ulong>(ref dest), 1) = Bmi2.X64.ParallelBitDeposit(hex1, 0x00FF00FF_00FF00FFul);

    static ulong Packed(ulong a) => a * 0x01010101_01010101ul;
}
In the microbenchmarks it is ~20...30% slower than the SIMD-variant. But it has no memory access, so maybe it is a good choice for real-world scenarios. @benaadams can you please run your benchmarks with this? I don't have a setup for this here.

@benaadams
Copy link
Member Author

Improved with feedback

            Method |   Branch |     Mean |     Error |         Op/s | Allocated | Precent |
------------------ |----------|---------:|----------:|-------------:|----------:|--------:|
 ConcatAsHexSuffix |   master | 21.18 ns | 0.1040 ns | 47,221,986.9 |  34.33 MB |         |
 ConcatAsHexSuffix | original | 17.26 ns | 0.0844 ns | 57,953,720.1 |  34.33 MB | 122.7 % |
 ConcatAsHexSuffix | feedback | 16.98 ns | 0.0905 ns | 58,881,989.3 |  34.33 MB | 124.7 % |

@benaadams
Copy link
Member Author

With the swar

            Method |   Branch |     Mean |     Error |         Op/s | Allocated | Precent |
------------------ |----------|---------:|----------:|-------------:|----------:|--------:|
 ConcatAsHexSuffix |   master | 21.18 ns | 0.1040 ns | 47,221,986.9 |  34.33 MB |         |
 ConcatAsHexSuffix | original | 17.26 ns | 0.0844 ns | 57,953,720.1 |  34.33 MB | 122.7 % |
 ConcatAsHexSuffix | feedback | 16.98 ns | 0.0905 ns | 58,881,989.3 |  34.33 MB | 124.7 % |
 ConcatAsHexSuffix |     swar | 17.03 ns | 0.0382 ns | 58,730,769.9 |  34.33 MB | 124.4 % |

Though PDEP is pretty slow on Rzyen/AMD (don't have to test) as used in the SWAR version

image

@gfoidl
Copy link
Member

gfoidl commented Jan 18, 2020

Thanks for the test. So let's leave the SWAR variant out.

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM 😃

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@BrennanConroy BrennanConroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit manipulation looks good 👍

Anyone else want to take a look?

// This must be explicity typed as ReadOnlySpan<byte>
// This then becomes a non-allocating mapping to the data section of the assembly.
// If it is a var, Span<byte> or byte[], it allocates the byte array per call.
ReadOnlySpan<byte> hexEncodeMap = new byte[] { (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9', (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E', (byte)'F' };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is the same as the asciiUpperCaseData above, so we can just lift it outside the if statement.

@analogrelay
Copy link
Contributor

@halter73 let's get this in soon.

@analogrelay analogrelay added this to the 5.0.0-preview2 milestone Feb 26, 2020
@analogrelay analogrelay merged commit 07ce586 into dotnet:master Feb 26, 2020
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants