Fix #10513 - powmod is slow for ulong by Aditya-132 · Pull Request #10688 · dlang/phobos

Aditya-132 · 2025-03-18T13:38:18Z

The key optimizations I've made are:

128-bit multiplication via inline assembly:

For 64-bit types on x86_64 platforms, I've added direct use of hardware multiplication that produces a 128-bit result (stored in high:low registers)
This avoids the repeated addition loop in the original algorithm

Efficient modular reduction:

Implemented a reduction algorithm that handles the 128-bit result efficiently
Uses a simplified approach for when the high bits are zero

Early reduction:

The base value is reduced modulo m at the beginning, which improves performance when the base is large and modulus is small

Special case handling:

Added explicit handling for modulus=1 and exponent=0 cases up front

Version checks:

The code falls back to the original algorithm if inline assembly is not available
This ensures portability while providing optimizations where possible

…lang#10513

dlang-bot · 2025-03-18T13:38:22Z

Thanks for your pull request and interest in making D better, @Aditya-132! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#10688"

ibuclaw · 2025-03-18T13:52:37Z

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

Aditya-132 · 2025-03-18T13:57:04Z

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

sure

thewilsonator · 2025-03-18T13:58:54Z

Please give the PR and commit titles a description based on the issues this solves e.g. Fix #10513 - powmod is slow for ulong

Aditya-132 · 2025-03-18T14:03:36Z

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

can you provide any example which is already present in code

Aditya-132 · 2025-03-18T14:40:13Z

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

instead of modifying powmod we can diredctly modify mulmod

ibuclaw · 2025-03-18T16:05:19Z

@Aditya-132 before I go through it, if you can run and post benchmarks, that'll help with review.

Out of curiosity, would you also be able to.do gcc-style asm for gdc and ldc?

can you provide any example which is already present in code

So, unlike dmd iasm which ~~is an uninlineable black box~~ forces you manually put in all load and stores, gcc iasm instead has the notion of input and output operands (first outputs, then inputs)

As all you're doing is mul, and taking the result from two registers, the equivalent should be

low = a;
asm @trusted nothrow @nogc {
    "mull %0"
    : "=d" (high), "+a"(low)
    : "0" (b);
}

Aditya-132 · 2025-03-18T21:52:23Z

@ibuclaw

TESTING RESULT

BENCHMARK 1: Testing mulmod implementations
Testing with 32-bit integers:
Testing first implementation (ASM-based):
Testing second implementation (Standard D):
First implementation (ASM):  9 ms, 270 μs, and 5 hnsecs
Second implementation (D):   13 ms, 119 μs, and 8 hnsecs
Performance ratio: 141.52% (higher means ASM is faster)
Verification values: 312463588 312463588

Testing with 64-bit integers:
Testing first implementation (ASM-based):
Testing second implementation (Standard D):
First implementation (ASM):  45 ms, 296 μs, and 4 hnsecs
Second implementation (D):   556 ms and 119 μs
Performance ratio: 1227.74% (higher means ASM is faster)
Verification values: 311557887622942605 311557887622942605


BENCHMARK 2: Testing powmod implementations
Testing with 32-bit integers:
Testing first powmod implementation (with ASM mulmod):
Testing second powmod implementation (with standard D mulmod):
First implementation (ASM):  1 ms, 862 μs, and 9 hnsecs
Second implementation (D):   1 ms, 904 μs, and 1 hnsec
Performance ratio: 102.26% (higher means ASM is faster)
Verification values: 8312587 8312587

Testing with 64-bit integers:
Testing first powmod implementation (with ASM mulmod):
Testing second powmod implementation (with standard D mulmod):
First implementation (ASM):  2 ms and 932 μs
Second implementation (D):   42 ms, 178 μs, and 2 hnsecs
Performance ratio: 1438.54% (higher means ASM is faster)
Verification values: 12615239 12615239
(dmd-2.110.0)➜  testing

Script which i use for testing

import std.stdio;
import std.datetime.stopwatch;
import std.random;
import std.traits;
import std.meta;
import std.algorithm;
import core.time;

// First implementation from the pasted function
static T mulmod1(T)(T a, T b, T c) {
    static if (T.sizeof == 8) {
        version (D_InlineAsm_X86_64) {
            // Fast path for small values
            ulong low = void;
            ulong high = void;
            
            // Perform 128-bit multiplication: a * b = [high:low]
            asm pure @trusted nothrow @nogc
            {
                mov RAX, a;      // Load a into RAX
                mul b;           // Multiply by b (RDX:RAX = a * b)
                mov low, RAX;    // Store low 64 bits
                mov high, RDX;   // Store high 64 bits
            }
            
            // Handle overflow for large values
            if (high >= c) {
                high %= c;
            }
            
            if (high == 0) {
                return low % c;
            }
            
            // Reduce the high 64 bits modulo mod
            asm pure @trusted nothrow @nogc
            {
                mov RAX, high;   // Load high part
                xor RDX, RDX;    // Clear RDX for division
                div c;         // RAX = high / mod, RDX = high % mod
                mov high, RDX;   // Store remainder (high % mod)

                mov RAX, low;    // Load low part
                mov RDX, high;   // Load reduced high part
                div c;         // Divide full 128-bit number by mod
                mov low, RDX;    // Store remainder (final result)
            }

            return low;  // Return (a * b) % mod
        }
        else {
            // Fallback for non-x86_64 platforms
            T result = 0;
            a %= c;
            b %= c;
            
            while (b > 0) {
                if (b & 1) {
                    result = (result + a) % c;
                }
                a = (a * 2) % c;
                b >>= 1;
            }
            
            return result;
        }
    }
    else static if (T.sizeof == 4) {
        version (D_InlineAsm_X86) {
            // Fast path for small values
            uint low = void;  // Changed to uint for 32-bit
            uint high = void; // Changed to uint for 32-bit
            
            // Perform 64-bit multiplication: a * b = [high:low]
            asm pure @trusted nothrow @nogc
            {
                mov EAX, a;      // Load a into EAX
                mul b;           // Multiply by b
                mov low, EAX;    // Store low bits
                mov high, EDX;   // Store high bits
            }
            
            // Handle overflow for large values
            if (high >= c) {
                high %= c;
            }
            
            if (high == 0) {
                return low % c;
            }
            
            // Reduce the high bits modulo mod
            asm pure @trusted nothrow @nogc
            {
                mov EAX, high;   // Load high part
                xor EDX, EDX;    // Clear EDX for division
                div c;           // EAX = high / mod, EDX = high % mod
                mov high, EDX;   // Store remainder (high % mod)

                mov EAX, low;    // Load low part
                mov EDX, high;   // Load reduced high part
                div c;           // Divide full number by mod
                mov low, EDX;    // Store remainder (final result)
            }

            return low;  // Return (a * b) % mod
        }
        else {
            // Use 64-bit type for the calculation
            ulong result = (cast(ulong)a * cast(ulong)b) % c;
            return cast(T)result;
        }
    }
    else {
        // Fallback for other sizes
        import std.bigint;
        return cast(T)((BigInt(a) * BigInt(b)) % BigInt(c));
    }
}

// Second implementation from function provided in the query
template Largest(T...) {
    static if (T.length == 1)
        alias Largest = T[0];
    else static if (T[0].sizeof > Largest!(T[1..$]).sizeof)
        alias Largest = T[0];
    else
        alias Largest = Largest!(T[1..$]);
}

static T mulmod2(T)(T a, T b, T c) {
    static if (T.sizeof == 8) {
        static T addmod(T a, T b, T c) {
            b = c - b;
            if (a >= b)
                return a - b;
            else
                return c - b + a;
        }
        T result = 0;
        b %= c;
        while (a > 0) {
            if (a & 1)
                result = addmod(result, b, c);
            a >>= 1;
            b = addmod(b, b, c);
        }
        return result;
    }
    else {
        alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];
        DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);
        return result % c;
    }
}

// FIX: Explicitly specify the return type
F powmod1(F, G, H)(F x, G n, H m)
if (isUnsigned!F && isUnsigned!G && isUnsigned!H)
{
    alias ReturnT = std.traits.Unqual!(Largest!(F, H));
    ReturnT base = cast(ReturnT)(x % m);
    ReturnT result = 1;
    ReturnT modulus = cast(ReturnT)m;
    std.traits.Unqual!G exponent = n;
    
    while (exponent > 0) {
        if (exponent & 1)
            result = mulmod1!ReturnT(result, base, modulus);
        base = mulmod1!ReturnT(base, base, modulus);
        exponent >>= 1;
    }
    return cast(F)result;
}

// FIX: Explicitly specify the return type
F powmod2(F, G, H)(F x, G n, H m)
if (isUnsigned!F && isUnsigned!G && isUnsigned!H)
{
    alias ReturnT = std.traits.Unqual!(Largest!(F, H));
    ReturnT base = cast(ReturnT)(x % m);
    ReturnT result = 1;
    ReturnT modulus = cast(ReturnT)m;
    std.traits.Unqual!G exponent = n;
    
    while (exponent > 0) {
        if (exponent & 1)
            result = mulmod2!ReturnT(result, base, modulus);
        base = mulmod2!ReturnT(base, base, modulus);
        exponent >>= 1;
    }
    return cast(F)result;
}

// DELETED conflicting Unqual template

void main() {
    // Run tests first to ensure correctness
    writeln("VERIFICATION TESTS:");
    runTests();
    
    // Test for 32-bit integers
    writeln("\nBENCHMARK 1: Testing mulmod implementations");
    writeln("Testing with 32-bit integers:");
    benchmarkMulmod!(uint)();
    
    // Test for 64-bit integers
    writeln("\nTesting with 64-bit integers:");
    benchmarkMulmod!(ulong)();
    
    // Test powmod implementations
    writeln("\n\nBENCHMARK 2: Testing powmod implementations");
    writeln("Testing with 32-bit integers:");
    benchmarkPowmod!(uint)();
    
    writeln("\nTesting with 64-bit integers:");
    benchmarkPowmod!(ulong)();
}

void runTests() {
    writeln("Running unit tests...");
    
    // Test cases provided in the query
    // First set of tests
    assert(powmod1(1U, 10U, 3U) == 1);
    assert(powmod1(3U, 2U, 6U) == 3);
    assert(powmod1(5U, 5U, 15U) == 5);
    assert(powmod1(2U, 3U, 5U) == 3);
    assert(powmod1(2U, 4U, 5U) == 1);
    assert(powmod1(2U, 5U, 5U) == 2);
    
    // Second implementation
    assert(powmod2(1U, 10U, 3U) == 1);
    assert(powmod2(3U, 2U, 6U) == 3);
    assert(powmod2(5U, 5U, 15U) == 5);
    assert(powmod2(2U, 3U, 5U) == 3);
    assert(powmod2(2U, 4U, 5U) == 1);
    assert(powmod2(2U, 5U, 5U) == 2);
    
    // Second set of tests - ulong
    ulong a = 18446744073709551615u, b = 20u, c = 18446744073709551610u;
    assert(powmod1(a, b, c) == 95367431640625u);
    a = 100; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 18223853583554725198u);
    a = 117; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 11493139548346411394u);
    a = 134; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 10979163786734356774u);
    a = 151; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 7023018419737782840u);
    a = 168; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 58082701842386811u);
    a = 185; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 17423478386299876798u);
    a = 202; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 5522733478579799075u);
    a = 219; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 15230218982491623487u);
    a = 236; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 5198328724976436000u);
    
    a = 0; b = 7919; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 0);
    a = 123; b = 0; c = 18446744073709551557u;
    assert(powmod1(a, b, c) == 1);
    
    immutable ulong a1 = 253, b1 = 7919, c1 = 18446744073709551557u;
    assert(powmod1(a1, b1, c1) == 3883707345459248860u);
    
    // Second implementation
    a = 18446744073709551615u; b = 20u; c = 18446744073709551610u;
    assert(powmod2(a, b, c) == 95367431640625u);
    a = 100; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 18223853583554725198u);
    a = 117; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 11493139548346411394u);
    a = 134; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 10979163786734356774u);
    a = 151; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 7023018419737782840u);
    a = 168; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 58082701842386811u);
    a = 185; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 17423478386299876798u);
    a = 202; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 5522733478579799075u);
    a = 219; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 15230218982491623487u);
    a = 236; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 5198328724976436000u);
    
    a = 0; b = 7919; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 0);
    a = 123; b = 0; c = 18446744073709551557u;
    assert(powmod2(a, b, c) == 1);
    
    assert(powmod2(a1, b1, c1) == 3883707345459248860u);
    
    // Third set of tests - uint
    uint x = 100, y = 7919, z = 1844674407u;
    assert(powmod1(x, y, z) == 1613100340u);
    x = 134; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 734956622u);
    x = 151; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 1738696945u);
    x = 168; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 1247580927u);
    x = 185; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 1293855176u);
    x = 202; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 1566963682u);
    x = 219; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 181227807u);
    x = 236; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 217988321u);
    x = 253; y = 7919; z = 1844674407u;
    assert(powmod1(x, y, z) == 1588843243u);
    
    x = 0; y = 7919; z = 184467u;
    assert(powmod1(x, y, z) == 0);
    x = 123; y = 0; z = 1844674u;
    assert(powmod1(x, y, z) == 1);
    
    // Second implementation
    x = 100; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1613100340u);
    x = 134; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 734956622u);
    x = 151; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1738696945u);
    x = 168; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1247580927u);
    x = 185; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1293855176u);
    x = 202; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1566963682u);
    x = 219; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 181227807u);
    x = 236; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 217988321u);
    x = 253; y = 7919; z = 1844674407u;
    assert(powmod2(x, y, z) == 1588843243u);
    
    x = 0; y = 7919; z = 184467u;
    assert(powmod2(x, y, z) == 0);
    x = 123; y = 0; z = 1844674u;
    assert(powmod2(x, y, z) == 1);
    
    writeln("All tests passed!");
}

void benchmarkMulmod(T)() {
    // Number of iterations for benchmarking
    immutable iterations = 1_000_000;
    
    // Generate random test cases
    T[] a_values = new T[iterations];
    T[] b_values = new T[iterations];
    T[] m_values = new T[iterations];
    
    auto rnd = Random(unpredictableSeed);
    
    // Fill arrays with random values
    for (size_t i = 0; i < iterations; i++) {
        static if (is(T == ulong)) {
            a_values[i] = uniform!("[]")(T.min, T.max / 2, rnd);
            b_values[i] = uniform!("[]")(T.min, T.max / 2, rnd);
            // Avoid modulus being too small or zero
            m_values[i] = uniform!("[]")(T.max / 4, T.max - 1, rnd);
        } else {
            a_values[i] = uniform!("[]")(T.min, T.max, rnd);
            b_values[i] = uniform!("[]")(T.min, T.max, rnd);
            // Avoid modulus being too small or zero
            m_values[i] = uniform!("[]")(T.max / 2, T.max - 1, rnd);
        }
    }
    
    // Variables to store results to prevent compiler optimizations
    T result1 = 0;
    T result2 = 0;
    
    // Test first implementation
    writeln("Testing first implementation (ASM-based):");
    auto sw1 = StopWatch(AutoStart.yes);
    
    for (size_t i = 0; i < iterations; i++) {
        result1 ^= mulmod1!T(a_values[i], b_values[i], m_values[i]);
    }
    
    sw1.stop();
    
    // Test second implementation
    writeln("Testing second implementation (Standard D):");
    auto sw2 = StopWatch(AutoStart.yes);
    
    for (size_t i = 0; i < iterations; i++) {
        result2 ^= mulmod2!T(a_values[i], b_values[i], m_values[i]);
    }
    
    sw2.stop();
    
    // Print results
    writefln("First implementation (ASM):  %s", sw1.peek());
    writefln("Second implementation (D):   %s", sw2.peek());
    double ratio = (sw2.peek().total!"usecs" * 100.0 / sw1.peek().total!"usecs");
    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);
    
    // Print dummy value to prevent optimization
    writefln("Verification values: %s %s", result1, result2);
}

void benchmarkPowmod(T)() {
    // Number of iterations for benchmarking
    immutable iterations = 10_000;
    
    // Generate random test cases
    T[] x_values = new T[iterations];
    T[] n_values = new T[iterations];
    T[] m_values = new T[iterations];
    
    auto rnd = Random(unpredictableSeed);
    
    // Fill arrays with random values - using realistic ranges
    for (size_t i = 0; i < iterations; i++) {
        static if (is(T == ulong)) {
            x_values[i] = uniform!("[]")(2, 1000000, rnd);
            n_values[i] = uniform!("[]")(10, 10000, rnd);
            // Avoid modulus being too small or zero
            m_values[i] = uniform!("[]")(1000, 10000000, rnd);
        } else {
            x_values[i] = uniform!("[]")(2, 1000000, rnd);
            n_values[i] = uniform!("[]")(10, 10000, rnd);
            // Avoid modulus being too small or zero
            m_values[i] = uniform!("[]")(1000, 10000000, rnd);
        }
    }
    
    // Variables to store results to prevent compiler optimizations
    T result1 = 0;
    T result2 = 0;
    
    // Test first implementation
    writeln("Testing first powmod implementation (with ASM mulmod):");
    auto sw1 = StopWatch(AutoStart.yes);
    
    for (size_t i = 0; i < iterations; i++) {
        result1 ^= powmod1!T(x_values[i], n_values[i], m_values[i]);
    }
    
    sw1.stop();
    
    // Test second implementation
    writeln("Testing second powmod implementation (with standard D mulmod):");
    auto sw2 = StopWatch(AutoStart.yes);
    
    for (size_t i = 0; i < iterations; i++) {
        result2 ^= powmod2!T(x_values[i], n_values[i], m_values[i]);
    }
    
    sw2.stop();
    
    // Print results
    writefln("First implementation (ASM):  %s", sw1.peek());
    writefln("Second implementation (D):   %s", sw2.peek());
    double ratio = (sw2.peek().total!"usecs" * 100.0 / sw1.peek().total!"usecs");
    writefln("Performance ratio: %.2f%% (higher means ASM is faster)", ratio);
    
    // Print dummy value to prevent optimization
    writefln("Verification values: %s %s", result1, result2);
}

Aditya-132 · 2025-03-18T23:08:40Z

@ibuclaw can you explain why a = 100; b = 7919; c = 18446744073709551557u;
assert(powmod(a, b, c) == 18223853583554725198u); it exactly failed on x86 systems

Aditya-132 · 2025-03-18T23:45:55Z

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems

how much i figureout is that this occured due to 32 bits x86 system the value overflowed

ibuclaw · 2025-03-19T20:18:07Z

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems

how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing the addmod function.

Aditya-132 · 2025-03-19T20:23:52Z

Anyone has solution of it please tell me i cant unterstand this errors on why they are coming on x86 systems
how much i figureout is that this occured due to 32 bits x86 system the value overflowed

Yes. The new fallback code is prone to overflow. The original implementation can't be made simpler by removing the addmod function.

i have a solution for that like we do in assembly we treat 16bit number in two register(in 8085) instaed of one and then perform operations on it . we can do same here by implementing 128 bits multipication using more registers in 32 bits (its complex)

ibuclaw · 2025-03-19T20:32:24Z

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?

alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];
DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);
return result % c;

Aditya-132 · 2025-03-19T20:34:22Z

I don't understand why you need inline assembler for 32bit powmod, when the original D code should be sufficient?
alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];
DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);
return result % c;

ok then i will make changes for it

ibuclaw · 2025-03-19T20:57:58Z

For the slow path, the addmod() function that protects against overflow is only required for modulus values greater than uint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

static T mulmod2(T)(T a, T b, T c) {
    static if (T.sizeof == 8) {
        static T addmod(T a, T b, T c) {
            b = c - b;
            if (a >= b)
                return a - b;
            else
                return c - b + a;
        }
        T result = void;        // <---- start new snip
        if (c <= 0x100000000)
        {
            result = a * b;
            return result % c;
        }
        result = 0;             // <---- end new snip
        b %= c;
        while (a > 0) {
            if (a & 1)
                result = addmod(result, b, c); 
            a >>= 1;
            b = addmod(b, b, c); 
        }
        return result;
    }
    else {
        alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];
        DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);
        return result % c;
    }
}

Testing with 64-bit integers:
Testing first implementation (ASM-based):
Testing second implementation (Standard D):
First implementation (ASM):  72 ms, 230 μs, and 8 hnsecs
Second implementation (D):   31 ms, 106 μs, and 6 hnsecs
Performance ratio: 43.07% (higher means ASM is faster)
Verification values: 270309499 270309499

Testing with 64-bit integers:
Testing first powmod implementation (with ASM mulmod):
Testing second powmod implementation (with standard D mulmod):
First implementation (ASM):  699 ms, 729 μs, and 5 hnsecs
Second implementation (D):   601 ms, 519 μs, and 3 hnsecs
Performance ratio: 85.96% (higher means ASM is faster)
Verification values: 1636129629 1636129629

Aditya-132 · 2025-03-19T21:08:13Z

For the slow path, the addmod() function that protects against overflow is only required for modulus values greater than uint.max+1 IIUC - (4294967296 * 4294967296 = 18446744073709551616 (which is 0 with ulong, so no overflow gets lost when modulus is also <= 4294967296).

Running the benchmark locally, the D code measured faster than the inline asm when the modulus was within this bounds.

static T mulmod2(T)(T a, T b, T c) {
    static if (T.sizeof == 8) {
        static T addmod(T a, T b, T c) {
            b = c - b;
            if (a >= b)
                return a - b;
            else
                return c - b + a;
        }
        T result = void;        // <---- start new snip
        if (c <= 0x100000000)
        {
            result = a * b;
            return result % c;
        }
        result = 0;             // <---- end new snip
        b %= c;
        while (a > 0) {
            if (a & 1)
                result = addmod(result, b, c); 
            a >>= 1;
            b = addmod(b, b, c); 
        }
        return result;
    }
    else {
        alias DoubleT = AliasSeq!(void, ushort, uint, void, ulong)[T.sizeof];
        DoubleT result = cast(DoubleT) (cast(DoubleT) a * cast(DoubleT) b);
        return result % c;
    }
}

Testing with 64-bit integers:
Testing first implementation (ASM-based):
Testing second implementation (Standard D):
First implementation (ASM):  72 ms, 230 μs, and 8 hnsecs
Second implementation (D):   31 ms, 106 μs, and 6 hnsecs
Performance ratio: 43.07% (higher means ASM is faster)
Verification values: 270309499 270309499

Testing with 64-bit integers:
Testing first powmod implementation (with ASM mulmod):
Testing second powmod implementation (with standard D mulmod):
First implementation (ASM):  699 ms, 729 μs, and 5 hnsecs
Second implementation (D):   601 ms, 519 μs, and 3 hnsecs
Performance ratio: 85.96% (higher means ASM is faster)
Verification values: 1636129629 1636129629

ok i agree then what is better way of optimizing it

ibuclaw · 2025-03-19T21:12:42Z

So I think it should be sufficient to optimize mulmod in the following way:

static T mulmod(T a, T b, T c)
{
    static if (T.sizeof == 8)
    {
        if (c <= 0x100000000)
        {
            T result = a * b;
            return result % c;
        }
        T result = 0;
        version (D_InlineAsm_X86_64)
        {
            // 128-bit mul with asm
        }
        else version (GNU_OR_LDC_X86_64)
        {
            // 128-bit mul with gcc extended asm
        }
        else
        {
            // slow addmod() method (does pragma(inline, true) help?)
        }
        return result;
    }
    else
    {
        // DoubleT method
    }
}

Does this make sense?

ibuclaw · 2025-03-19T21:34:39Z

So I think it should be sufficient to optimize mulmod in the following way:

static T mulmod(T a, T b, T c)
{
    static if (T.sizeof == 8)
    {
        if (c <= 0x100000000)
        {
            T result = a * b;
            return result % c;
        }

It might even help when T.sizeof == 4 and c <= 0x10000 as well to elide using DoubleT (64-bit) mul/div.

Aditya-132 · 2025-03-19T21:43:05Z

So I think it should be sufficient to optimize mulmod in the following way:

static T mulmod(T a, T b, T c)
{
    static if (T.sizeof == 8)
    {
        if (c <= 0x100000000)
        {
            T result = a * b;
            return result % c;
        }
        T result = 0;
        version (D_InlineAsm_X86_64)
        {
            // 128-bit mul with asm
        }
        else version (GNU_OR_LDC_X86_64)
        {
            // 128-bit mul with gcc extended asm
        }
        else
        {
            // slow addmod() method (does pragma(inline, true) help?)
        }
        return result;
    }
    else
    {
        // DoubleT method
    }
}

Does this make sense?

i think you are correct here this looks similiar to other open source language iplementation of function

Aditya-132 · 2025-03-19T22:06:18Z

Sorry for inconvenience
Will need 1 Day time to implement it due to exams
hopefully you will consider it.

Aditya-132 · 2025-03-24T00:44:55Z

sorry for wasting your time with inLine ASM should have thought of using core.int128 earlier

Aditya-132 · 2025-03-24T00:54:07Z

@kinke @ibuclaw you can review this implmentation with core.int128

std/math/exponential.d

Co-authored-by: Iain Buclaw <[email protected]>

std/math/exponential.d

ibuclaw · 2025-03-25T07:04:35Z

std/math/exponential.d

+            const product128 = mul(Cent(a), Cent(b));
+            Cent centRemainder;
+            udivmod(product128, Cent(c), centRemainder);
+            return cast(T)(centRemainder.lo);


Sorry @Aditya-132, looks like @kinke wasn't done with changes to core.int128.

dlang/dmd#21081

With that PR merged, the correct overloads to call are:

Suggested change

const product128 = mul(Cent(a), Cent(b));

Cent centRemainder;

udivmod(product128, Cent(c), centRemainder);

return cast(T)(centRemainder.lo);

const product128 = mul(a, b);

T remainder = void;

udivmod(product128, c, remainder);

return remainder;

Thanks for the patience. 🙇

UFCS should also be possible

T remainder = void; mul(a, b).udivmod(c, remainder); return remainder;

UFCS should also be possible

T remainder = void; mul(a, b).udivmod(c, remainder); return remainder;

its giving error

Suggested change are also giving error

DMD PR is merged now, please rebase on top of phobos/master.

@kinke done with changes

So the error is floating point exception?

There was overflow code between the mul and div once, but that disappeared for some reason?

auto product = mul(a, b); if (product.hi >= c) product.hi %= c; T remainder = void; udivmod(product, c, remainder); return remainder;

should i make above changes in code

Yes, adding the overflow condition should allow you to use 64bit udivmod.

Co-authored-by: Iain Buclaw <[email protected]>

ibuclaw · 2025-03-26T14:31:14Z

std/math/exponential.d

+            else
            {
-                if (a & 1)
-                    result = addmod(result, b, c);
-
-                a >>= 1;
-                b = addmod(b, b, c);
+            import core.int128 : Cent, mul, udivmod;
+            auto product = mul(a, b);
+            if (product.hi >= c)
+            product.hi %= c;
+            T remainder = void;
+            udivmod(product, c, remainder);
+            return remainder;
+            }


Is this being rendered wrong or is indentation missing?

done with correction

Aditya-132 · 2025-03-27T09:12:38Z

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

ibuclaw · 2025-03-27T09:51:09Z

Is there any specific reason why it fails for [Main / FreeBSD 13.2 x64 (pull_request)

The action 'Run in VM' has timed out after 19 minutes.

Unrelated.

Co-authored-by: Iain Buclaw <[email protected]>

Aditya-132 added 2 commits March 18, 2025 18:55

The current implementation of powmod is very slow for the ulong type d…

b889996

…lang#10513

some imports changes

f860160

Aditya-132 requested a review from ibuclaw as a code owner March 18, 2025 13:38

Aditya-132 changed the title ~~Fixing Issue-#10513~~ Fix #10513 - powmod is slow for ulong Mar 18, 2025

made asm functions for mulmod and

7fc002d

changes made for x86 32 bits system

43c35de

Aditya-132 force-pushed the issue-#10513 branch 2 times, most recently from 8c7d8c0 to 43c35de Compare March 19, 2025 10:50

Merge branch 'dlang:master' into issue-dlang#10513

394af25

implmented the given changes

e1cb9a9

Merge branch 'dlang:master' into issue-dlang#10513

23dffa7

kinke mentioned this pull request Mar 24, 2025

[druntime] core.int128: Optimize a bit and prepare for more LDC/GDC optimizations dlang/dmd#21071

Merged

added method with core.int128

933d72f

ibuclaw reviewed Mar 24, 2025

View reviewed changes

std/math/exponential.d Outdated Show resolved Hide resolved

std/math/exponential.d Outdated Show resolved Hide resolved

std/math/exponential.d Outdated Show resolved Hide resolved

Aditya-132 requested review from ibuclaw and kinke March 24, 2025 01:07

Aditya-132 and others added 2 commits March 24, 2025 07:41

removed line 61 to 65

2bf559a

Co-authored-by: Iain Buclaw <[email protected]>

styling changees

ef8a738

Co-authored-by: Iain Buclaw <[email protected]>

ibuclaw reviewed Mar 24, 2025

View reviewed changes

std/math/exponential.d Outdated Show resolved Hide resolved

Aditya-132 force-pushed the issue-#10513 branch from 6edc6ab to ef8a738 Compare March 25, 2025 03:21

styling changes

dfcfb4c

ibuclaw reviewed Mar 25, 2025

View reviewed changes

Aditya-132 and others added 3 commits March 26, 2025 02:24

changes udivmod implementaion

8f643bb

Co-authored-by: Iain Buclaw <[email protected]>

removed Cent from mul operation

71a1e9f

changed to Cent(lo: c)

4f79940

Aditya-132 requested a review from ibuclaw March 26, 2025 02:43

changes as per review

6e7531c

ibuclaw reviewed Mar 26, 2025

View reviewed changes

Aditya-132 added 2 commits March 27, 2025 05:53

indentation changes

c8a2400

indentation changes

6d6fad6

Aditya-132 requested a review from ibuclaw March 27, 2025 00:47

Merge branch 'dlang:master' into issue-dlang#10513

bced460

ibuclaw approved these changes Mar 27, 2025

View reviewed changes

thewilsonator merged commit dfc3d02 into dlang:master Mar 27, 2025
9 of 10 checks passed

Aditya-132 deleted the issue-#10513 branch March 27, 2025 10:54

vpanteleev-sym pushed a commit to vpanteleev-sym/phobos that referenced this pull request Jun 5, 2025

Fix dlang#10513 - powmod is slow for ulong (dlang#10688)

96fc0f3

Co-authored-by: Iain Buclaw <[email protected]>

Uh oh!

Conversation

Aditya-132 commented Mar 18, 2025

Uh oh!

dlang-bot commented Mar 18, 2025

Bugzilla references

Testing this PR locally

Uh oh!

ibuclaw commented Mar 18, 2025

Uh oh!

Aditya-132 commented Mar 18, 2025

Uh oh!

thewilsonator commented Mar 18, 2025

Uh oh!

Aditya-132 commented Mar 18, 2025

Uh oh!

Aditya-132 commented Mar 18, 2025

Uh oh!

ibuclaw commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commented Mar 18, 2025 • edited by ibuclaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TESTING RESULT

Script which i use for testing

Uh oh!

Aditya-132 commented Mar 18, 2025

Uh oh!

Aditya-132 commented Mar 18, 2025

Uh oh!

ibuclaw commented Mar 19, 2025

Uh oh!

Aditya-132 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibuclaw commented Mar 19, 2025

Uh oh!

Aditya-132 commented Mar 19, 2025

Uh oh!

ibuclaw commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commented Mar 19, 2025

Uh oh!

ibuclaw commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibuclaw commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commented Mar 19, 2025

Uh oh!

Aditya-132 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aditya-132 commented Mar 24, 2025

Uh oh!

Aditya-132 commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibuclaw Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ibuclaw commented Mar 18, 2025 •

edited

Loading

Aditya-132 commented Mar 18, 2025 •

edited by ibuclaw

Loading

Aditya-132 commented Mar 19, 2025 •

edited

Loading

ibuclaw commented Mar 19, 2025 •

edited

Loading

ibuclaw commented Mar 19, 2025 •

edited

Loading

ibuclaw commented Mar 19, 2025 •

edited

Loading

Aditya-132 commented Mar 19, 2025 •

edited

Loading

ibuclaw Mar 26, 2025 •

edited

Loading