Skip to content

Bmi2 MultiplyNoFlags2 #44926

@echesakov

Description

@echesakov

Background and Motivation

In .NET 3.1 we introduced the following intrinsics to expose mulx x86 instruction:

  • uint MultiplyNoFlags(uint left, uint right, uint* low)
  • ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)

where low is an out parameter that is used to return the lower 32-bit/64-bit part of 64-bit/128-bit result of left * right multiplication while the return value contains the upper part.

When the instrinsics are used the JIT produces sub-optimal code due to the fact that low has "address-taken" attribute.

For example, the following C# methods

static unsafe uint mulx(uint a, uint b)
{
    uint r;
    return Bmi2.MultiplyNoFlags(a, b, &r) + r;
}

static unsafe ulong mulx_64(ulong a, ulong b)
{
    ulong r;
    return Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}

will be compiled down to the following code by the current implementation of the JIT
mulx

G_M48748_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       89442404             mov      dword ptr [rsp+04H], eax
       89542418             mov      dword ptr [rsp+18H], edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG02:              ;; offset=000BH
       488D442404           lea      rax, bword ptr [rsp+04H]
       448B442418           mov      r8d, dword ptr [rsp+18H]
       8BD1                 mov      edx, ecx
       C4C233F6D0           mulx     edx, r9d, r8d
       448908               mov      dword ptr [rax], r9d
       8BC2                 mov      eax, edx
       03442404             add      eax, dword ptr [rsp+04H]
                                                ;; bbWeight=1    PerfScore 7.00
G_M48748_IG03:              ;; offset=0025H
       4883C408             add      rsp, 8
       C3                   ret

mulx_64

G_M55976_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       48890424             mov      qword ptr [rsp], rax
       4889542418           mov      qword ptr [rsp+18H], rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG02:              ;; offset=000CH
       488D0424             lea      rax, bword ptr [rsp]
       4C8B442418           mov      r8, qword ptr [rsp+18H]
       488BD1               mov      rdx, rcx
       C4C2B3F6D0           mulx     rdx, r9, r8
       4C8908               mov      qword ptr [rax], r9
       488BC2               mov      rax, rdx
       48030424             add      rax, qword ptr [rsp]
                                                ;; bbWeight=1    PerfScore 7.00
G_M55976_IG03:              ;; offset=0027H
       4883C408             add      rsp, 8
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.25

However, if the Bmi2.MultiplyNoFlags were implemented instead as

public static unsafe uint MultiplyNoFlags(uint left, uint right, uint* low)
{
    var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

public static unsafe ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)
{
    var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

the JIT as in #37928 would inline MultiplyNoFlags and be able to remove the address-taken attribute from a local corresponding to low:
mulx

G_M48748_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M48748_IG02:              ;; offset=0000H
       C4E27BF6D1           mulx     edx, eax, ecx
       03C2                 add      eax, edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG03:              ;; offset=0007H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

mulx_64

G_M55976_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M55976_IG02:              ;; offset=0000H
       C4E2FBF6D1           mulx     rdx, rax, rcx
       4803C2               add      rax, rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG03:              ;; offset=0008H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Bmi2 : X86Base
    {
        public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);

        public abstract class X64: X86Base.X64
        {
            public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
        }
    }
}

Based on work Carol did in #37928

cc @CarolEidt @tannergooding

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions