Skip to content

Conversation

@saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jan 27, 2025

Resolves #85207

  • Replaces the SSE4.1 fallback for long vector multiply with a faster SSE2 version and removes restrictions on op_Multiply and MultiplyAddEstimate intrinsics since these can always be accelerated now.
  • Removes AVX2 requirement for Vector256.Sum to be treated as intrinsic (only AVX instructions are used).
  • Removes restrictions on byte and long types so that Dot can be treated as intrinsic for all types.
  • Adds Vector512.Dot as intrinsic.
     
    Diffs look good. The only regressions are due to inlining or the slightly larger (but faster) SSE2 multiply code.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 27, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 27, 2025
@saucecontrol saucecontrol marked this pull request as ready for review January 27, 2025 20:27
@saucecontrol
Copy link
Member Author

saucecontrol commented Jan 27, 2025

I was just looking at the SSE4.1/AVX2 fallback for long multiply, and I think we should just replace it with the SSE2 one I added here (extended to AVX2 as well, ofc).

The current fallback only has two multiplications compared to 3 for the new one, but one of those is a pmulld, which is slow on Intel (10 cycles compared to 5 for pmuludq), plus phaddd and 2x pshufd, which can also bottleneck on older Intel.

Quick benchmark, tested on main vs local with the SSE4.1 fallback removed:

[SimpleJob, DisassemblyDiagnoser]
public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 16);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res += Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }
}

Skylake

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-SDELDU \core_root_main\corerun.exe 618.0 ns 9.43 ns 8.36 ns 1.00 0.02 82 B
Multiply Job-UZJSFT \core_root_vdot\corerun.exe 486.0 ns 7.40 ns 6.18 ns 0.79 0.01 89 B

Meteor Lake

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-IBHRLY \core_root_main\corerun.exe 539.4 ns 20.20 ns 59.55 ns 1.01 0.16 82 B
Multiply Job-WVQFRW \core_root_vdot\corerun.exe 455.4 ns 8.86 ns 17.69 ns 0.85 0.10 89 B

Turns out the SSE2-only version is faster on AMD as well.

Zen 5

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-NUXFIL \core_root_main\corerun.exe 309.7 ns 0.99 ns 0.93 ns 1.00 0.00 82 B
Multiply Job-PQZTMG \core_root_vdot\corerun.exe 243.9 ns 0.83 ns 0.78 ns 0.79 0.00 89 B

Here's the disasm for SSE4.1

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       pshufd    xmm2,xmm2,0B1
       pmulld    xmm1,xmm2
       xorps     xmm2,xmm2
       phaddd    xmm1,xmm2
       pshufd    xmm1,xmm1,73
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 82

And here's the SSE2 replacement

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       movaps    xmm4,xmm2
       psrlq     xmm4,20
       pmuludq   xmm4,xmm1
       psrlq     xmm1,20
       pmuludq   xmm1,xmm2
       paddq     xmm1,xmm4
       psllq     xmm1,20
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 89

@saucecontrol
Copy link
Member Author

@EgorBot -amd -intel --envvars DOTNET_EnableAVX512F:0

using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;

using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 64);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply128()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector256<long> Multiply256()
    {
        long* ptr = data, end = ptr + nitems - Vector256<long>.Count;
        var res = Vector256<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<long>.Count);
            ptr += Vector256<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector<long> MultiplyVectorT()
    {
        long* ptr = data, end = ptr + nitems - Vector<long>.Count;
        var res = Vector<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector.Load(ptr) * Vector.Load(ptr + Vector256<long>.Count);
            ptr += Vector<long>.Count;
        }

        return res;
    }
}

@saucecontrol
Copy link
Member Author

cc @EgorBo I believe you were the last to touch most of this

@EgorBo EgorBo self-requested a review March 10, 2025 19:43
@EgorBo
Copy link
Member

EgorBo commented Mar 11, 2025

/azp run Fuzzlyn, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@EgorBo EgorBo merged commit f565711 into dotnet:main Mar 11, 2025
126 of 139 checks passed
@saucecontrol saucecontrol deleted the vdot branch March 11, 2025 20:59
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Finish Avx512 specific lightup for Vector128/256/512<T>

3 participants