-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.
Example: Multiplying by 7
C#
using System;
public static class C
{
public static ulong M7(ulong value) => value * 7;
}C.M7(UInt64)
L0000: imul rax, rcx, 7
L0004: retC++ with Clang 16.0.0
#include <cstdint>
uint64_t M7(uint64_t value) { return value * 7; }M7(unsigned long): # @M7(unsigned long)
lea rax, [8*rdi]
sub rax, rdi
retConfiguration
For unsigned integer constants less than 127 that is optimized by LLVM with no imul instruction:
SharpLab
Compiler Explorer
Regression?
No
Data
Analysis
imul instruction with 8bit/32bit immediate value
This is what we are trying to avoid. uops.info says imul instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.
The way of multiplying integer with constant integer without imul instruction
Advanced lea sequence
lea stands for 'Load Effective Address'. x86 lea instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.
Formulas for Effective Addresses can be one of:
$Displacement$ $Base + Displacement$ $(Index * Scale) + Displacement$ $Base + (Index * Scale) + Displacement$
The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.
Using a single lea instruction, we can multiply integers with some constant integers like 3, 5, and 9.
Current RyuJIT can handle this situation correctly:
using System;
public static class C
{
public static ulong M3(ulong value) => value * 3;
public static ulong M5(ulong value) => value * 5;
public static ulong M9(ulong value) => value * 9;
}; Core CLR 6.0.722.32202 on amd64
## value is in rcx
C.M3(UInt64)
L0000: lea rax, [rcx+rcx*2]
L0004: ret
C.M5(UInt64)
L0000: lea rax, [rcx+rcx*4]
L0004: ret
C.M9(UInt64)
L0000: lea rax, [rcx+rcx*8]
L0004: retBut we can do even more.
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:
#include <cstdint>
uint64_t M11(uint64_t value) { return value * 11; }
uint64_t M15(uint64_t value) { return value * 15; }
uint64_t M19(uint64_t value) { return value * 19; }## value is in rdi
M11(unsigned long): # @M11(unsigned long)
lea rax, [rdi + 4*rdi] # rax = 5 * rdi
lea rax, [rdi + 2*rax] # rax = rdi + rax * 2
ret # now rax is rdi + (5 * rdi) * 2 = rdi * 11
M15(unsigned long): # @M15(unsigned long)
lea rax, [rdi + 4*rdi] # rax = 5 * rdi
lea rax, [rax + 2*rax] # rax = rax * 3
ret # now rax is (5 * rdi) * 3 = rdi * 15
M19(unsigned long): # @M19(unsigned long)
lea rax, [rdi + 8*rdi] # rax = 9 * rdi
lea rax, [rdi + 2*rax] # rax = rdi + rax * 2
ret # now rax is rdi + (9 * rdi) * 2 = rdi * 19using System;
public static class C
{
public static ulong M11(ulong value) => value * 11;
public static ulong M15(ulong value) => value * 15;
public static ulong M19(ulong value) => value * 19;
}; Core CLR 6.0.722.32202 on amd64
C.M11(UInt64)
L0000: imul rax, rcx, 0xb # It blindly executes `imul` instruction...
L0004: ret
C.M15(UInt64)
L0000: imul rax, rcx, 0xf # It blindly executes `imul` instruction...
L0004: ret
C.M19(UInt64)
L0000: imul rax, rcx, 0x13 # It blindly executes `imul` instruction...
L0004: retWe can observe LLVM is doing something new. The
shl followed by lea
lea instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.
Let's see how two compilers optimize multiplication by 66, 68, and 70:
using System;
public static class C
{
public static ulong M66(ulong value) => value * 66;
public static ulong M68(ulong value) => value * 68;
public static ulong M70(ulong value) => value * 70;
}; Core CLR 6.0.722.32202 on amd64
C.M66(UInt64)
L0000: imul rax, rcx, 0x42 # Needless to say, it blindly executes `imul` instruction.
L0004: ret
C.M68(UInt64)
L0000: imul rax, rcx, 0x44
L0004: ret
C.M70(UInt64)
L0000: imul rax, rcx, 0x46
L0004: ret#include <cstdint>
uint64_t M66(uint64_t value) { return value * 66; }
uint64_t M68(uint64_t value) { return value * 68; }
uint64_t M70(uint64_t value) { return value * 70; }M66(unsigned long): # @M66(unsigned long)
mov rax, rdi
shl rax, 6 # rax = rdi * 64
lea rax, [rax + 2*rdi] # rax = rax + rdi * 2
ret # now rax is rdi * 64 + rdi * 2 = rdi * 66
M68(unsigned long): # @M68(unsigned long)
mov rax, rdi
shl rax, 6 # rax = rdi * 64
lea rax, [rax + 4*rdi] # rax = rax + rdi * 4
ret # now rax is rdi * 64 + rdi * 2 = rdi * 68
M70(unsigned long): # @M70(unsigned long)
imul rax, rdi, 70 # IT EXECUTES `imul` INSTRUCTION!
retshl followed by sub
sub instruction can be used for multiplying integer with power of 2 subtracted by 1.
By executing sub twice, it can be used for multiplying integer with power of 2 subtracted by 2.
Let's see how two compilers optimize multiplication by 62 and 63:
using System;
public static class C
{
public static ulong M62(ulong value) => value * 62;
public static ulong M63(ulong value) => value * 63;
}; Core CLR 6.0.722.32202 on amd64
C.M62(UInt64)
L0000: imul rax, rcx, 0x3e
L0004: ret
C.M63(UInt64)
L0000: imul rax, rcx, 0x3f
L0004: ret#include <cstdint>
uint64_t M62(uint64_t value) { return value * 62; }
uint64_t M63(uint64_t value) { return value * 63; }M62(unsigned long): # @M62(unsigned long)
mov rax, rdi
shl rax, 6
sub rax, rdi
sub rax, rdi
ret
M63(unsigned long): # @M63(unsigned long)
mov rax, rdi
shl rax, 6
sub rax, rdi
retMultiplying by power of 2
When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:
using System;
public static class C
{
public static ulong M6(ulong value) => value * 6;
public static ulong M20(ulong value) => value * 20;
public static ulong M72(ulong value) => value * 72;
}; Core CLR 6.0.722.32202 on amd64
C.M6(UInt64)
L0000: lea rax, [rcx+rcx*2] # first multiply by 3,
L0004: add rax, rax # then `add rax, rax`
L0007: ret
C.M20(UInt64)
L0000: lea rax, [rcx+rcx*4]
L0004: shl rax, 2
L0008: ret
C.M72(UInt64)
L0000: lea rax, [rcx+rcx*8]
L0004: shl rax, 3
L0008: ret#include <cstdint>
uint64_t M6(uint64_t value) { return value * 6; }
uint64_t M20(uint64_t value) { return value * 20; }
uint64_t M72(uint64_t value) { return value * 72; }M6(unsigned long): # @M6(unsigned long)
add rdi, rdi # multiply by 2 first,
lea rax, [rdi + 2*rdi] # then multiply by 3
ret
M20(unsigned long): # @M20(unsigned long)
shl rdi, 2
lea rax, [rdi + 4*rdi]
ret
M72(unsigned long): # @M72(unsigned long)
shl rdi, 3
lea rax, [rdi + 8*rdi]
retInterestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.
Both compilers managed to optimize multiplication by 2 to add rdi, rdi (or whatever adds the same register together), though.
Combining these techniques
By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.
| Number | RyuJIT | LLVM |
|---|---|---|
| 3 | ||
| 5 | ||
| 7 | (imul) | |
| 9 | ||
| 11 | (imul) | |
| 13 | (imul) | |
| 14 | (imul) | |
| 15 | (imul) | |
| 17 | (imul) | |
| 19 | (imul) | |
| 21 | (imul) | |
| 22 | (imul) | |
| 23 | (imul) | |
| 25 | (imul) | |
| 26 | (imul) | |
| 27 | (imul) | |
| 28 | (imul) | |
| 29 | (imul) | |
| 30 | (imul) | |
| 31 | (imul) | |
| 33 | (imul) | |
| 34 | (imul) | |
| 37 | (imul) | |
| 41 | (imul) | |
| 45 | (imul) | |
| 48 | (imul) | |
| 62 | (imul) | |
| 63 | (imul) | |
| 65 | (imul) | |
| 66 | (imul) | |
| 68 | (imul) | |
| 73 | (imul) | |
| 80 | (imul) | |
| 81 | (imul) | |
| 96 | (imul) | |
| 126 | (imul) | |
| 127 | (imul) |
Any multiplication by power of two can be performed by an shl instruction.
Multiplication by 3, 5, or 9 can be performed by a single lea instruction.