JIT: Accelerate Vector.Dot for all base types #111853

saucecontrol · 2025-01-27T07:59:22Z

Resolves #85207

Replaces the SSE4.1 fallback for long vector multiply with a faster SSE2 version and removes restrictions on op_Multiply and MultiplyAddEstimate intrinsics since these can always be accelerated now.
Removes AVX2 requirement for Vector256.Sum to be treated as intrinsic (only AVX instructions are used).
Removes restrictions on byte and long types so that Dot can be treated as intrinsic for all types.
Adds Vector512.Dot as intrinsic.

Diffs look good. The only regressions are due to inlining or the slightly larger (but faster) SSE2 multiply code.

src/coreclr/jit/gentree.cpp

saucecontrol · 2025-01-27T22:28:29Z

I was just looking at the SSE4.1/AVX2 fallback for long multiply, and I think we should just replace it with the SSE2 one I added here (extended to AVX2 as well, ofc).

The current fallback only has two multiplications compared to 3 for the new one, but one of those is a pmulld, which is slow on Intel (10 cycles compared to 5 for pmuludq), plus phaddd and 2x pshufd, which can also bottleneck on older Intel.

Quick benchmark, tested on main vs local with the SSE4.1 fallback removed:

[SimpleJob, DisassemblyDiagnoser]
public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 16);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res += Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }
}

Skylake

Method	Job	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Code Size
Multiply	Job-SDELDU	\core_root_main\corerun.exe	618.0 ns	9.43 ns	8.36 ns	1.00	0.02	82 B
Multiply	Job-UZJSFT	\core_root_vdot\corerun.exe	486.0 ns	7.40 ns	6.18 ns	0.79	0.01	89 B

Meteor Lake

Method	Job	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Code Size
Multiply	Job-IBHRLY	\core_root_main\corerun.exe	539.4 ns	20.20 ns	59.55 ns	1.01	0.16	82 B
Multiply	Job-WVQFRW	\core_root_vdot\corerun.exe	455.4 ns	8.86 ns	17.69 ns	0.85	0.10	89 B

Turns out the SSE2-only version is faster on AMD as well.

Zen 5

Method	Job	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Code Size
Multiply	Job-NUXFIL	\core_root_main\corerun.exe	309.7 ns	0.99 ns	0.93 ns	1.00	0.00	82 B
Multiply	Job-PQZTMG	\core_root_vdot\corerun.exe	243.9 ns	0.83 ns	0.78 ns	0.79	0.00	89 B

Here's the disasm for SSE4.1

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       pshufd    xmm2,xmm2,0B1
       pmulld    xmm1,xmm2
       xorps     xmm2,xmm2
       phaddd    xmm1,xmm2
       pshufd    xmm1,xmm1,73
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 82

And here's the SSE2 replacement

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       movaps    xmm4,xmm2
       psrlq     xmm4,20
       pmuludq   xmm4,xmm1
       psrlq     xmm1,20
       pmuludq   xmm1,xmm2
       paddq     xmm1,xmm4
       psllq     xmm1,20
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 89

saucecontrol · 2025-03-10T17:20:32Z

@EgorBot -amd -intel --envvars DOTNET_EnableAVX512F:0

using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;

using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 64);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply128()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector256<long> Multiply256()
    {
        long* ptr = data, end = ptr + nitems - Vector256<long>.Count;
        var res = Vector256<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<long>.Count);
            ptr += Vector256<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector<long> MultiplyVectorT()
    {
        long* ptr = data, end = ptr + nitems - Vector<long>.Count;
        var res = Vector<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector.Load(ptr) * Vector.Load(ptr + Vector256<long>.Count);
            ptr += Vector<long>.Count;
        }

        return res;
    }
}

saucecontrol · 2025-03-10T18:04:47Z

cc @EgorBo I believe you were the last to touch most of this

EgorBo · 2025-03-11T12:17:11Z

/azp run Fuzzlyn, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512

azure-pipelines · 2025-03-11T12:17:34Z

Azure Pipelines successfully started running 3 pipeline(s).

EgorBo

LGTM, thanks!

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 27, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 27, 2025

This was referenced Jan 27, 2025

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

accelerate Vector.Dot for all base types

3562aea

saucecontrol force-pushed the vdot branch from 2aa705e to 3562aea Compare January 27, 2025 16:54

tannergooding reviewed Jan 27, 2025

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

saucecontrol mentioned this pull request Jan 27, 2025

JIT: Simplify x86 special intrinsic imports #111836

Merged

add assert

41df6d6

saucecontrol marked this pull request as ready for review January 27, 2025 20:27

saucecontrol added 2 commits January 27, 2025 15:39

remove sse4.1 long multiply fallback

f4ccc5d

Merge remote-tracking branch 'upstream/main' into vdot

e151a21

saucecontrol mentioned this pull request Feb 4, 2025

JIT: Illegal instruction at JitTest_chain_boxunbox_il.Test.Main() under DOTNET_JitStressRegs=0x2000 #112163

Closed

Merge remote-tracking branch 'upstream/main' into vdot

95e163d

EgorBot mentioned this pull request Mar 10, 2025

EgorBot for saucecontrol in #111853 EgorBot/runtime-utils#312

Open

Merge remote-tracking branch 'upstream/main' into vdot

02de0e9

EgorBo self-requested a review March 10, 2025 19:43

remove unnecessary AVX2 requirement for Sum

7dea308

EgorBo approved these changes Mar 11, 2025

View reviewed changes

EgorBo merged commit f565711 into dotnet:main Mar 11, 2025
126 of 139 checks passed

saucecontrol deleted the vdot branch March 11, 2025 20:59

github-actions bot locked and limited conversation to collaborators Apr 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: Accelerate Vector.Dot for all base types #111853

JIT: Accelerate Vector.Dot for all base types #111853

Uh oh!

saucecontrol commented Jan 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

saucecontrol commented Jan 27, 2025 •

edited

Loading

Uh oh!

saucecontrol commented Mar 10, 2025

Uh oh!

saucecontrol commented Mar 10, 2025

Uh oh!

EgorBo commented Mar 11, 2025

Uh oh!

azure-pipelines bot commented Mar 11, 2025

Uh oh!

EgorBo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JIT: Accelerate Vector.Dot for all base types #111853

JIT: Accelerate Vector.Dot for all base types #111853

Uh oh!

Conversation

saucecontrol commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

saucecontrol commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saucecontrol commented Mar 10, 2025

Uh oh!

saucecontrol commented Mar 10, 2025

Uh oh!

EgorBo commented Mar 11, 2025

Uh oh!

azure-pipelines bot commented Mar 11, 2025

Uh oh!

EgorBo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saucecontrol commented Jan 27, 2025 •

edited

Loading

saucecontrol commented Jan 27, 2025 •

edited

Loading