Skip to content

Possible optimization opportunity of integer multiplication by constant #75119

@MineCake147E

Description

@MineCake147E

Description

Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.

Example: Multiplying by 7

C#

using System;
public static class C
{
    public static ulong M7(ulong value) => value * 7;
}
C.M7(UInt64)
    L0000: imul rax, rcx, 7
    L0004: ret

C++ with Clang 16.0.0

#include <cstdint>

uint64_t M7(uint64_t value) { return value * 7; }
M7(unsigned long):                                 # @M7(unsigned long)
        lea     rax, [8*rdi]
        sub     rax, rdi
        ret

Configuration

For unsigned integer constants less than 127 that is optimized by LLVM with no imul instruction:
SharpLab
Compiler Explorer

Regression?

No

Data

Analysis

imul instruction with 8bit/32bit immediate value

This is what we are trying to avoid. uops.info says imul instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.

The way of multiplying integer with constant integer without imul instruction

Advanced lea sequence

lea stands for 'Load Effective Address'. x86 lea instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.
Formulas for Effective Addresses can be one of:

  • $Displacement$
  • $Base + Displacement$
  • $(Index * Scale) + Displacement$
  • $Base + (Index * Scale) + Displacement$

The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.

Using a single lea instruction, we can multiply integers with some constant integers like 3, 5, and 9.
Current RyuJIT can handle this situation correctly:

using System;
public static class C
{
    public static ulong M3(ulong value) => value * 3;
    public static ulong M5(ulong value) => value * 5;
    public static ulong M9(ulong value) => value * 9;
}
; Core CLR 6.0.722.32202 on amd64
## value is in rcx
C.M3(UInt64)
    L0000: lea rax, [rcx+rcx*2]
    L0004: ret

C.M5(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: ret

C.M9(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: ret

But we can do even more.
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:

#include <cstdint>

uint64_t M11(uint64_t value) { return value * 11; }
uint64_t M15(uint64_t value) { return value * 15; }
uint64_t M19(uint64_t value) { return value * 19; }
## value is in rdi
M11(unsigned long):                                # @M11(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (5 * rdi) * 2 = rdi * 11
M15(unsigned long):                                # @M15(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rax + 2*rax]                 # rax = rax * 3
        ret                                        # now rax is (5 * rdi) * 3 = rdi * 15
M19(unsigned long):                                # @M19(unsigned long)
        lea     rax, [rdi + 8*rdi]                 # rax = 9 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (9 * rdi) * 2 = rdi * 19
using System;
public static class C
{
    public static ulong M11(ulong value) => value * 11;
    public static ulong M15(ulong value) => value * 15;
    public static ulong M19(ulong value) => value * 19;
}
; Core CLR 6.0.722.32202 on amd64

C.M11(UInt64)
    L0000: imul rax, rcx, 0xb   # It blindly executes `imul` instruction...
    L0004: ret

C.M15(UInt64)
    L0000: imul rax, rcx, 0xf   # It blindly executes `imul` instruction...
    L0004: ret

C.M19(UInt64)
    L0000: imul rax, rcx, 0x13  # It blindly executes `imul` instruction...
    L0004: ret

We can observe LLVM is doing something new. The $Base + (Index * Scale) + Displacement$ pattern is used twice, with Displacement set to 0, and Base and Index are different sometimes. We can multiply an integer by either 1, 2, 4, or 8, then add with another integer.

shl followed by lea

lea instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.
Let's see how two compilers optimize multiplication by 66, 68, and 70:

using System;
public static class C
{
    public static ulong M66(ulong value) => value * 66;
    public static ulong M68(ulong value) => value * 68;
    public static ulong M70(ulong value) => value * 70;
}
; Core CLR 6.0.722.32202 on amd64

C.M66(UInt64)
    L0000: imul rax, rcx, 0x42      # Needless to say, it blindly executes `imul` instruction.
    L0004: ret

C.M68(UInt64)
    L0000: imul rax, rcx, 0x44
    L0004: ret

C.M70(UInt64)
    L0000: imul rax, rcx, 0x46
    L0004: ret
#include <cstdint>

uint64_t M66(uint64_t value) { return value * 66; }
uint64_t M68(uint64_t value) { return value * 68; }
uint64_t M70(uint64_t value) { return value * 70; }
M66(unsigned long):                                # @M66(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 2*rdi]                 # rax = rax + rdi * 2
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 66
M68(unsigned long):                                # @M68(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 4*rdi]                 # rax = rax + rdi * 4
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 68
M70(unsigned long):                                # @M70(unsigned long)
        imul    rax, rdi, 70                       # IT EXECUTES `imul` INSTRUCTION!
        ret

shl followed by sub

sub instruction can be used for multiplying integer with power of 2 subtracted by 1.
By executing sub twice, it can be used for multiplying integer with power of 2 subtracted by 2.
Let's see how two compilers optimize multiplication by 62 and 63:

using System;
public static class C
{
    public static ulong M62(ulong value) => value * 62;
    public static ulong M63(ulong value) => value * 63;
}
; Core CLR 6.0.722.32202 on amd64

C.M62(UInt64)
    L0000: imul rax, rcx, 0x3e
    L0004: ret

C.M63(UInt64)
    L0000: imul rax, rcx, 0x3f
    L0004: ret
#include <cstdint>

uint64_t M62(uint64_t value) { return value * 62; }
uint64_t M63(uint64_t value) { return value * 63; }
M62(unsigned long):                                # @M62(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        sub     rax, rdi
        ret
M63(unsigned long):                                # @M63(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        ret

Multiplying by power of 2

When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:

using System;
public static class C
{
    public static ulong M6(ulong value) => value * 6;
    public static ulong M20(ulong value) => value * 20;
    public static ulong M72(ulong value) => value * 72;
}
; Core CLR 6.0.722.32202 on amd64

C.M6(UInt64)
    L0000: lea rax, [rcx+rcx*2]     # first multiply by 3,
    L0004: add rax, rax             # then `add rax, rax`
    L0007: ret

C.M20(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: shl rax, 2
    L0008: ret

C.M72(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: shl rax, 3
    L0008: ret
#include <cstdint>

uint64_t M6(uint64_t value) { return value * 6; }
uint64_t M20(uint64_t value) { return value * 20; }
uint64_t M72(uint64_t value) { return value * 72; }
M6(unsigned long):                                 # @M6(unsigned long)
        add     rdi, rdi                           # multiply by 2 first,
        lea     rax, [rdi + 2*rdi]                 # then multiply by 3
        ret
M20(unsigned long):                                # @M20(unsigned long)
        shl     rdi, 2
        lea     rax, [rdi + 4*rdi]
        ret
M72(unsigned long):                                # @M72(unsigned long)
        shl     rdi, 3
        lea     rax, [rdi + 8*rdi]
        ret

Interestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.
Both compilers managed to optimize multiplication by 2 to add rdi, rdi (or whatever adds the same register together), though.

Combining these techniques

By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.

Number RyuJIT LLVM
3 $(rdi + 2 * rdi)$ $(rdi + 2 * rdi)$
5 $(rdi + 4 * rdi)$ $(rdi + 4 * rdi)$
7 (imul) $(8 * rdi) - rdi$
9 $(rdi + 8 * rdi)$ $(rdi + 8 * rdi)$
11 (imul) $rdi + 2 * (5 * rdi)$
13 (imul) $rdi + 4 * (3 * rdi)$
14 (imul) $(16 * rdi) - rdi - rdi$
15 (imul) $(5 * rdi) * 3$
17 (imul) $rdi + 16 * rdi$
19 (imul) $rdi + 2 * (9 * rdi)$
21 (imul) $rdi + 4 * (5 * rdi)$
22 (imul) $(rdi + 4 * (5 * rdi)) + rdi$
23 (imul) $(8 * (3 * rdi)) - rdi$
25 (imul) $5 * (5 * rdi)$
26 (imul) $rdi + 5 * (5 * rdi)$
27 (imul) $3 * (9 * rdi)$
28 (imul) $rdi + 3 * (9 * rdi)$
29 (imul) $rdi + rdi + 3 * (9 * rdi)$
30 (imul) $(32 * rdi) - rdi - rdi$
31 (imul) $(32 * rdi) - rdi$
33 (imul) $(32 * rdi) + rdi$
34 (imul) $(32 * rdi) + 2 * rdi$
37 (imul) $rdi + 4 * (9 * rdi)$
41 (imul) $rdi + 8 * (5 * rdi)$
45 (imul) $5 * (9 * rdi)$
48 (imul) $3 * (4 * rdi)$
62 (imul) $(64 * rdi) - rdi - rdi$
63 (imul) $(64 * rdi) - rdi$
65 (imul) $(64 * rdi) + rdi$
66 (imul) $(64 * rdi) + 2 * rdi$
68 (imul) $(64 * rdi) + 4 * rdi$
73 (imul) $rdi + 8 * (9 * rdi)$
80 (imul) $5 * (8 * rdi)$
81 (imul) $9 * (9 * rdi)$
96 (imul) $3 * (32 * rdi)$
126 (imul) $(128 * rdi) - rdi - rdi$
127 (imul) $(128 * rdi) - rdi$

Any multiplication by power of two can be performed by an shl instruction.
Multiplication by 3, 5, or 9 can be performed by a single lea instruction.

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIgood first issueIssue should be easy to implement, good for first-time contributorshelp wanted[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions