Possible optimization opportunity of integer multiplication by constant

## Description

Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.

### Example: Multiplying by 7

#### C#

```csharp
using System;
public static class C
{
    public static ulong M7(ulong value) => value * 7;
}
```

```asm
C.M7(UInt64)
    L0000: imul rax, rcx, 7
    L0004: ret
```

#### C++ with Clang 16.0.0

```cpp
#include <cstdint>

uint64_t M7(uint64_t value) { return value * 7; }
```

```asm
M7(unsigned long):                                 # @M7(unsigned long)
        lea     rax, [8*rdi]
        sub     rax, rdi
        ret
```

## Configuration

For unsigned integer constants less than 127 that is optimized by LLVM with no `imul` instruction:  
[SharpLab](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACY8pJ0hgYRoG8aH+mjZqwCuAGwgA7AOYMAsnQAU4qbIBu2MSJgBKBgF4AfAw1aYDAFQM6Abj4D6TFgxUz5AVmUS3p7XqMmmtqWDO521AKCTqLesnJIXqqBZv7GvuZWSOGRjsIusfIA7Ik+QboGaWUhhdkOQs6ucQCcJeplqcnBVk21/LkNBXLkAAytneUB6SEjvVF5jfLk5GPpHVNWS7P9MUlDpCvtFePTpFv1O25DSgurR+sM5Lb2fef5u+QoByl3VRsoZ9E3pdyJ4bodJr8HmFnnMBu9imDvhCzNMajDtkC4uQABxfPw/FEbbEA+aDcgtRH45FdB49dGvBZyUijSkTSqEhjMklwy4UPFs45WCjci5xUj7VlrSHikWY+Ska4FW7UjKcp4ROqAxmkT6Sgk0nWy7WgpXg9kG6Eal5awakBJ6lUhO1G20I01I82q0hoq2w0Xy3EOz1O4n0m27UgU91U4NCum+jGMugs6MC+7Jl27OjLIOC6zkTOXOiKpLK2PWdU5BmDOi61NSjm1wtxOj2+v61Wt5vyOhu0tmvO97tyFAp/sevOj4coHPtx1WGfTk3jmOTy1V8OXFCBufl7fDpAS3d5w8HkulCf3JCVzWk3ZIZcX1dX9e3nlxJBtldpyGfg8778GxpJBQwTatdkKI9AI7apTjDO9LkKc82kvSEkOHbExyfH8OUwjDZ2g+cGGxAt4PfeQmi/bCgNVSjh3IO1+Rok4sjI/09j7aiYI2b1wgAXyAA==)  
[Compiler Explorer](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxCCSAJykAA6oCoRODB7evnrJqY4CQSHhLFEx8baY9vkMQgRMxASZPn5cFVXptfUEhWGR0bEJCnUNTdmtQ109xaUDAJS2qF7EyOwc5gDMwcjeWADUJutuyEP4ggfYJhoAgpdXXsEEAGySAPoEuwCy6xD3gs9vuwAbmIvJhZvsAOwWXbETAEJYMIEgzC7ABUu3WB2hJghABFbr8nq93h8AKw/B7/d7A7xgyHQ2Hw4iImmgtG7UlYyH466EqmfR4Uv7EpG08E4hlwhGitnox5cnE8u6UkUfCFCokA1l0iUwqXMmUo9EQhV4gkqgEfOIa/na8VQvVMlnI9lxU1KvmqrgaG0iu30x3S7Xs73u83Cy1cLi%2BrXI%2B2Sp2GkNcMO8i0krhmGPUuMBxlBl3ozOp5URjPfT2xsV5/XO2khzHrbFmtNlz5SbOG%2BOBg3BouSEuVjPkoddmuJvu7Licpvc8OajPq0f%2B3X53uFqcm2eK%2Bf8j5cAAcnZXDrXddlU4Pg/T7ety9zq9rSaLbu3LdLC8%2BZh99%2Brj4nG7ftebYfGY0a/qC3Zns%2BuxgcBn6gVmEE6qeT6TmYZjwXuZgVjeJ4JgW9bojhWGqmYkjHg%2BqEAURsEDm%2BHo3qBI54VRBHrrRZgzs2jEgWYgrIVBaGAfKDG7mRS6sX%2B1GEReZhbjx4mWmYR6CeOslGrBV5ia2CFmHeUmQepHFya%2Bim6Xu6w/oZKHseemlWaRlrrOBNlCTRF4uU5JLrLhbb4T29nsr53mfOsFFqf%2BGnBfR5kfpZAlucZQXousolxaOXySf5bGBTB6wKXOFmqpI1k5dJdkwaVoUfJIrnlUZUUmZpdU1ZILENbZeWTu1bWqUlTUpbskjaRlTGPEhA0yc17ITTVjx%2BZ%2BAXQZOC3zR1S25StG6PNxRXxaqjyJZ17nRXK6X7Zljz9SdyUwddNUQpNt2DTBT2PYttpbcJtEQo2Y0gQeZWbRV3UbkDNUHvVIONdNQ1QzVcTHTDXXbbRSM1ZmyNfaDaMXljmPyZRuM/fj8klhw8y0JwpK8H4HBaKQqCcG41jWLsCiLMsKIbDwpAEJolPzAA1iApIaAAdFIaWSPEh5cI8EIaKSpL6Jwkh04LTOcLwCggBo/OC/McCwEgaAsIkdDROQlDm5b9AxNshjAG8xBeAwwt8HQBDRHrEARFrETBPUACenB8%2BbbCCAA8gwtBhwzvBYCwzviInpD4LCDh4ICmB6%2BnmCqJgyBeD74e8A8lRa7QeARMQoceFgWsEMQeAsOX8xUAYwAKAAangmAAO7R4kjDlzIggiGI7BSBP8hKGoWu6GY%2BjOygbOWPotd65A8yoIk1T5wAtCcBy4qYljWN%2BuxH9HVBMEMN8p8sCBn2IWDEAYwuYLrlTF9ULgGDuE8M0fwQCph9BiK0XIaQBCjBaEkFIsCGAQJKP0cYf9s4CE6CMEBYw2j/w6MMbowRehoKgbYYh8C9ATAaKgmYXB5icyWCsPQLdMCrB4FTGmmt07Mw4KoA8jwj7PF2E7IwuwIAt3dsLcEEBWaX03rsXAhASD7Bwq0XYHgLZW2IOo9YZhZi8AFonWYIsBgSwPKSBaKsNBQwPAYpGasOAa1IPTRm/Ddb60NqY7hHAzC8I8TrHxWgzGkFzsQVIzhJBAA)

## Regression?

No

## Data

## Analysis

### `imul` instruction with 8bit/32bit immediate value

This is what we are trying to avoid. [uops.info](https://uops.info/table.html) says `imul` instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.

### The way of multiplying integer with constant integer without `imul` instruction

#### Advanced `lea` sequence

`lea` stands for 'Load Effective Address'. x86 `lea` instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.  
Formulas for Effective Addresses can be one of:

- $Displacement$
- $Base + Displacement$
- $(Index * Scale) + Displacement$
- $Base + (Index * Scale) + Displacement$

The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.  
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.

Using a single `lea` instruction, we can multiply integers with some constant integers like 3, 5, and 9.  
Current RyuJIT can handle this situation correctly:

```csharp
using System;
public static class C
{
    public static ulong M3(ulong value) => value * 3;
    public static ulong M5(ulong value) => value * 5;
    public static ulong M9(ulong value) => value * 9;
}
```

```asm
; Core CLR 6.0.722.32202 on amd64
## value is in rcx
C.M3(UInt64)
    L0000: lea rax, [rcx+rcx*2]
    L0004: ret

C.M5(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: ret

C.M9(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: ret
```

But we can do even more.  
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:

```cpp
#include <cstdint>

uint64_t M11(uint64_t value) { return value * 11; }
uint64_t M15(uint64_t value) { return value * 15; }
uint64_t M19(uint64_t value) { return value * 19; }
```

```asm
## value is in rdi
M11(unsigned long):                                # @M11(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (5 * rdi) * 2 = rdi * 11
M15(unsigned long):                                # @M15(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rax + 2*rax]                 # rax = rax * 3
        ret                                        # now rax is (5 * rdi) * 3 = rdi * 15
M19(unsigned long):                                # @M19(unsigned long)
        lea     rax, [rdi + 8*rdi]                 # rax = 9 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (9 * rdi) * 2 = rdi * 19
```

```csharp
using System;
public static class C
{
    public static ulong M11(ulong value) => value * 11;
    public static ulong M15(ulong value) => value * 15;
    public static ulong M19(ulong value) => value * 19;
}
```

```asm
; Core CLR 6.0.722.32202 on amd64

C.M11(UInt64)
    L0000: imul rax, rcx, 0xb   # It blindly executes `imul` instruction...
    L0004: ret

C.M15(UInt64)
    L0000: imul rax, rcx, 0xf   # It blindly executes `imul` instruction...
    L0004: ret

C.M19(UInt64)
    L0000: imul rax, rcx, 0x13  # It blindly executes `imul` instruction...
    L0004: ret
```

We can observe LLVM is doing something new. The $Base + (Index * Scale) + Displacement$ pattern is used twice, with Displacement set to 0, and Base and Index are different sometimes. We can multiply an integer by either 1, 2, 4, or 8, then add with another integer.

#### `shl` followed by `lea`

`lea` instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.  
Let's see how two compilers optimize multiplication by 66, 68, and 70:

```csharp
using System;
public static class C
{
    public static ulong M66(ulong value) => value * 66;
    public static ulong M68(ulong value) => value * 68;
    public static ulong M70(ulong value) => value * 70;
}
```

```asm
; Core CLR 6.0.722.32202 on amd64

C.M66(UInt64)
    L0000: imul rax, rcx, 0x42      # Needless to say, it blindly executes `imul` instruction.
    L0004: ret

C.M68(UInt64)
    L0000: imul rax, rcx, 0x44
    L0004: ret

C.M70(UInt64)
    L0000: imul rax, rcx, 0x46
    L0004: ret
```

```cpp
#include <cstdint>

uint64_t M66(uint64_t value) { return value * 66; }
uint64_t M68(uint64_t value) { return value * 68; }
uint64_t M70(uint64_t value) { return value * 70; }
```

```asm
M66(unsigned long):                                # @M66(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 2*rdi]                 # rax = rax + rdi * 2
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 66
M68(unsigned long):                                # @M68(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 4*rdi]                 # rax = rax + rdi * 4
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 68
M70(unsigned long):                                # @M70(unsigned long)
        imul    rax, rdi, 70                       # IT EXECUTES `imul` INSTRUCTION!
        ret
```

#### `shl` followed by `sub`

`sub` instruction can be used for multiplying integer with power of 2 subtracted by 1.  
By executing `sub` twice, it can be used for multiplying integer with power of 2 subtracted by 2.  
Let's see how two compilers optimize multiplication by 62 and 63:

```csharp
using System;
public static class C
{
    public static ulong M62(ulong value) => value * 62;
    public static ulong M63(ulong value) => value * 63;
}
```

```asm
; Core CLR 6.0.722.32202 on amd64

C.M62(UInt64)
    L0000: imul rax, rcx, 0x3e
    L0004: ret

C.M63(UInt64)
    L0000: imul rax, rcx, 0x3f
    L0004: ret
```

```cpp
#include <cstdint>

uint64_t M62(uint64_t value) { return value * 62; }
uint64_t M63(uint64_t value) { return value * 63; }
```

```asm
M62(unsigned long):                                # @M62(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        sub     rax, rdi
        ret
M63(unsigned long):                                # @M63(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        ret
```

#### Multiplying by power of 2

When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.  
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:

```csharp
using System;
public static class C
{
    public static ulong M6(ulong value) => value * 6;
    public static ulong M20(ulong value) => value * 20;
    public static ulong M72(ulong value) => value * 72;
}
```

```asm
; Core CLR 6.0.722.32202 on amd64

C.M6(UInt64)
    L0000: lea rax, [rcx+rcx*2]     # first multiply by 3,
    L0004: add rax, rax             # then `add rax, rax`
    L0007: ret

C.M20(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: shl rax, 2
    L0008: ret

C.M72(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: shl rax, 3
    L0008: ret
```

```cpp
#include <cstdint>

uint64_t M6(uint64_t value) { return value * 6; }
uint64_t M20(uint64_t value) { return value * 20; }
uint64_t M72(uint64_t value) { return value * 72; }
```

```asm
M6(unsigned long):                                 # @M6(unsigned long)
        add     rdi, rdi                           # multiply by 2 first,
        lea     rax, [rdi + 2*rdi]                 # then multiply by 3
        ret
M20(unsigned long):                                # @M20(unsigned long)
        shl     rdi, 2
        lea     rax, [rdi + 4*rdi]
        ret
M72(unsigned long):                                # @M72(unsigned long)
        shl     rdi, 3
        lea     rax, [rdi + 8*rdi]
        ret
```

Interestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.  
Both compilers managed to optimize multiplication by 2 to `add rdi, rdi` (or whatever adds the same register together), though.

### Combining these techniques

By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.  
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.

| Number | RyuJIT            | LLVM                          |
| ------ | ----------------- | ----------------------------- |
| 3      | $(rdi + 2 * rdi)$ | $(rdi + 2 * rdi)$             |
| 5      | $(rdi + 4 * rdi)$ | $(rdi + 4 * rdi)$             |
| 7      | (imul)            | $(8 * rdi) - rdi$             |
| 9      | $(rdi + 8 * rdi)$ | $(rdi + 8 * rdi)$             |
| 11     | (imul)            | $rdi + 2 * (5 * rdi)$         |
| 13     | (imul)            | $rdi + 4 * (3 * rdi)$         |
| 14     | (imul)            | $(16 * rdi) - rdi - rdi$      |
| 15     | (imul)            | $(5 * rdi) * 3$               |
| 17     | (imul)            | $rdi + 16 * rdi$              |
| 19     | (imul)            | $rdi + 2 * (9 * rdi)$         |
| 21     | (imul)            | $rdi + 4 * (5 * rdi)$         |
| 22     | (imul)            | $(rdi + 4 * (5 * rdi)) + rdi$ |
| 23     | (imul)            | $(8 * (3 * rdi)) - rdi$       |
| 25     | (imul)            | $5 * (5 * rdi)$               |
| 26     | (imul)            | $rdi + 5 * (5 * rdi)$         |
| 27     | (imul)            | $3 * (9 * rdi)$               |
| 28     | (imul)            | $rdi + 3 * (9 * rdi)$         |
| 29     | (imul)            | $rdi + rdi + 3 * (9 * rdi)$   |
| 30     | (imul)            | $(32 * rdi) - rdi - rdi$      |
| 31     | (imul)            | $(32 * rdi) - rdi$            |
| 33     | (imul)            | $(32 * rdi) + rdi$            |
| 34     | (imul)            | $(32 * rdi) + 2 * rdi$        |
| 37     | (imul)            | $rdi + 4 * (9 * rdi)$         |
| 41     | (imul)            | $rdi + 8 * (5 * rdi)$         |
| 45     | (imul)            | $5 * (9 * rdi)$               |
| 48     | (imul)            | $3 * (4 * rdi)$               |
| 62     | (imul)            | $(64 * rdi) - rdi - rdi$      |
| 63     | (imul)            | $(64 * rdi) - rdi$            |
| 65     | (imul)            | $(64 * rdi) + rdi$            |
| 66     | (imul)            | $(64 * rdi) + 2 * rdi$        |
| 68     | (imul)            | $(64 * rdi) + 4 * rdi$        |
| 73     | (imul)            | $rdi + 8 * (9 * rdi)$         |
| 80     | (imul)            | $5 * (8 * rdi)$               |
| 81     | (imul)            | $9 * (9 * rdi)$               |
| 96     | (imul)            | $3 * (32 * rdi)$              |
| 126    | (imul)            | $(128 * rdi) - rdi - rdi$     |
| 127    | (imul)            | $(128 * rdi) - rdi$           |

Any multiplication by power of two can be performed by an `shl` instruction.  
Multiplication by 3, 5, or 9 can be performed by a single `lea` instruction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible optimization opportunity of integer multiplication by constant #75119

Description

Example: Multiplying by 7

C#

C++ with Clang 16.0.0

Configuration

Regression?

Data

Analysis

`imul` instruction with 8bit/32bit immediate value

The way of multiplying integer with constant integer without `imul` instruction

Advanced `lea` sequence

`shl` followed by `lea`

`shl` followed by `sub`

Multiplying by power of 2

Combining these techniques

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Number	RyuJIT	LLVM
3	$(rdi + 2 * rdi)$	$(rdi + 2 * rdi)$
5	$(rdi + 4 * rdi)$	$(rdi + 4 * rdi)$
7	(imul)	$(8 * rdi) - rdi$
9	$(rdi + 8 * rdi)$	$(rdi + 8 * rdi)$
11	(imul)	$rdi + 2 * (5 * rdi)$
13	(imul)	$rdi + 4 * (3 * rdi)$
14	(imul)	$(16 * rdi) - rdi - rdi$
15	(imul)	$(5 * rdi) * 3$
17	(imul)	$rdi + 16 * rdi$
19	(imul)	$rdi + 2 * (9 * rdi)$
21	(imul)	$rdi + 4 * (5 * rdi)$
22	(imul)	$(rdi + 4 * (5 * rdi)) + rdi$
23	(imul)	$(8 * (3 * rdi)) - rdi$
25	(imul)	$5 * (5 * rdi)$
26	(imul)	$rdi + 5 * (5 * rdi)$
27	(imul)	$3 * (9 * rdi)$
28	(imul)	$rdi + 3 * (9 * rdi)$
29	(imul)	$rdi + rdi + 3 * (9 * rdi)$
30	(imul)	$(32 * rdi) - rdi - rdi$
31	(imul)	$(32 * rdi) - rdi$
33	(imul)	$(32 * rdi) + rdi$
34	(imul)	$(32 * rdi) + 2 * rdi$
37	(imul)	$rdi + 4 * (9 * rdi)$
41	(imul)	$rdi + 8 * (5 * rdi)$
45	(imul)	$5 * (9 * rdi)$
48	(imul)	$3 * (4 * rdi)$
62	(imul)	$(64 * rdi) - rdi - rdi$
63	(imul)	$(64 * rdi) - rdi$
65	(imul)	$(64 * rdi) + rdi$
66	(imul)	$(64 * rdi) + 2 * rdi$
68	(imul)	$(64 * rdi) + 4 * rdi$
73	(imul)	$rdi + 8 * (9 * rdi)$
80	(imul)	$5 * (8 * rdi)$
81	(imul)	$9 * (9 * rdi)$
96	(imul)	$3 * (32 * rdi)$
126	(imul)	$(128 * rdi) - rdi - rdi$
127	(imul)	$(128 * rdi) - rdi$

Possible optimization opportunity of integer multiplication by constant #75119

Description

Description

Example: Multiplying by 7

C#

C++ with Clang 16.0.0

Configuration

Regression?

Data

Analysis

imul instruction with 8bit/32bit immediate value

The way of multiplying integer with constant integer without imul instruction

Advanced lea sequence

shl followed by lea

shl followed by sub

Multiplying by power of 2

Combining these techniques

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`imul` instruction with 8bit/32bit immediate value

The way of multiplying integer with constant integer without `imul` instruction

Advanced `lea` sequence

`shl` followed by `lea`

`shl` followed by `sub`