Basic addition of int4 functionality by jrajala6 · Pull Request #343 · cactus-compute/cactus

jrajala6 · 2026-02-12T08:55:24Z

No description provided.

cactus/engine/engine.h

cactus/kernel/kernel_matmul.cpp

Copilot

Pull request overview

Adds initial INT4 (4-bit) grouped-weight matmul support end-to-end (Python export → graph loader/executor → NEON kernel), plus a new kernel correctness test.

Changes:

Introduces INT4 GEMV/GEMM kernels with packed (nibble) weight format and integrates dispatch in graph matmul.
Updates Python tensor export to pack INT4 weights (planar layout) and avoids INT4 for embedding weights.
Adjusts graph byte sizing logic to use packed sizes for INT4 across buffer sizing, I/O, and debug capture.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_kernel.cpp	Adds INT4 matmul correctness test (GEMV + GEMM).
python/src/tensor_io.py	Changes INT4 packing to planar layout; forces embedding weights to INT8 when INT4 requested.
cactus/kernel/kernel_quants.cpp	Updates INT4 nibble decode to offset-binary; (unpack layout needs follow-up).
cactus/kernel/kernel_matmul.cpp	Adds INT4 GEMV/GEMM/matmul implementation (packed weights + optional bias correction).
cactus/kernel/kernel.h	Exposes INT4 matmul APIs.
cactus/graph/graph_ops_tensor.cpp	Uses packed byte sizing for gather copies (INT4-aware sizing).
cactus/graph/graph_ops_nn.cpp	Routes grouped INT4 RHS to cactus_matmul_int4 and updates error messages.
cactus/graph/graph_io.cpp	Uses packed byte sizing; adds INT4 handling in mmap paths; (INT4 unpack + save scales need follow-up).
cactus/graph/graph_execute.cpp	Uses packed byte sizing for debug capture writes.
cactus/graph/graph_core.cpp	Computes BufferDesc::byte_size using packed byte sizing.
cactus/graph/graph.h	Adds is_grouped_int4 and packed_size_of() helper used across graph.
cactus/engine/engine.h	Adds constructors to index structs.
.gitignore	Adds _profile.txt and normalizes .bin entry.

Comments suppressed due to low confidence (2)

cactus/graph/graph_io.cpp:33

This INT4 unpack helper assumes pairwise nibble packing (alternating low/high outputs). The Python INT4 writer now uses planar packing (first 16 values in low nibbles, next 16 in high nibbles), so this unpacking will reorder values incorrectly when enable_int4_packing is disabled. Update it to expand each byte into two outputs placed 16 apart within each 16-byte block (or otherwise reconstruct low[0..15] then high[0..15]).

    inline void unpack_int4_to_int8(const uint8_t* packed, int8_t* unpacked, size_t packed_size) {
        for (size_t i = 0; i < packed_size; i++) {
            uint8_t byte = packed[i];

            int8_t low = static_cast<int8_t>((byte & 0x0F) - 8);

            int8_t high = static_cast<int8_t>(((byte >> 4) & 0x0F) - 8);

            unpacked[i * 2] = low;
            unpacked[i * 2 + 1] = high;
        }

cactus/graph/graph_io.cpp:192

save_node() only sets FLAG_HAS_SCALES / group_size / num_groups for grouped INT8 buffers. Group-wise INT4 tensors also have scales (and are now supported elsewhere), so saving an INT4 node will drop its scales metadata and produce an invalid/incomplete file. Extend has_scales to include grouped INT4 (and ensure the header encodes the group params/scales bytes for INT4 too).

    size_t byte_size = PrecisionTraits::packed_size_of(precision, total_elements);

    bool has_scales = (precision == Precision::INT8 && buffer.is_grouped_int8() && buffer.scales_data);
    size_t N = shape.size() >= 1 ? shape[0] : 1;
    size_t scales_bytes = has_scales ? (N * buffer.num_groups * sizeof(__fp16)) : 0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cactus/kernel/kernel_quants.cpp

cactus/graph/graph_execute.cpp

Copilot

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cactus/graph/graph_execute.cpp

cactus/kernel/kernel_quants.cpp

cactus/kernel/kernel_matmul.cpp

Copilot

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cactus/graph/graph_ops_tensor.cpp

cactus/graph/graph_execute.cpp

Copilot

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T20:04:21Z

cactus/graph/graph_ops_tensor.cpp

+                std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size),
+                            tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size),
+                            bytes_per_element);


The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.

Copilot · 2026-02-20T20:04:21Z

cactus/graph/graph_ops_tensor.cpp

+                std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size),
+                            tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size),
+                            bytes_per_element);


The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.

Copilot · 2026-02-20T20:04:22Z

cactus/graph/graph_ops_tensor.cpp

    }

+    const size_t copy_block_elements = slice_length * inner_elements;
+    const size_t axis_stride_elements = axis_size * inner_elements;


The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, copy_block_elements and axis_stride_elements may not be multiples of 32 if the slice dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors along axes where the slice length or stride doesn't result in multiples of 32 elements.

Suggested change

const size_t axis_stride_elements = axis_size * inner_elements;

const size_t axis_stride_elements = axis_size * inner_elements;

// For INT4 precision, element offsets passed to byte_offset_of must be multiples of 32.

// Enforce this alignment for the copy block size, axis stride, and slice start offset.

if (input_buffer.precision == Precision::INT4) {

const size_t slice_start_elements = slice_start * inner_elements;

if ((copy_block_elements % 32) != 0 ||

(axis_stride_elements % 32) != 0 ||

(slice_start_elements % 32) != 0) {

throw std::runtime_error("Slice on INT4 tensors requires 32-element alignment of slice length, axis stride, and slice start.");

}

}

Copilot · 2026-02-20T20:04:22Z

cactus/graph/graph_ops_tensor.cpp

+
    for (size_t outer = 0; outer < outer_elements; ++outer) {
-        const char* src = input_ptr + outer * axis_stride_bytes + slice_start * inner_elements * element_size;
+        const char* src = input_ptr + outer * axis_stride_bytes + PrecisionTraits::byte_offset_of(input_buffer.precision, slice_start * inner_elements);


The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, slice_start * inner_elements may not be a multiple of 32 if the slice start position or inner dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors.

Copilot · 2026-02-20T20:04:22Z

cactus/graph/graph_ops_tensor.cpp

+        char* output_offset_bytes = output_data + PrecisionTraits::byte_offset_of(input_buffer.precision, output_idx);
+        const char* input_offset_bytes = input_data + PrecisionTraits::byte_offset_of(input_buffer.precision, input_base);
+        size_t length = PrecisionTraits::byte_offset_of(input_buffer.precision, slice_size);


The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, output_idx, input_base, and slice_size may not be multiples of 32 depending on the tensor shape and indexing parameters. This could cause assertion failures when indexing into INT4 tensors.

cactus/graph/graph_ops_tensor.cpp

Signed-off-by: Jisha Rajala <[email protected]>

Signed-off-by: Noah Cylich <[email protected]>

Signed-off-by: Jisha Rajala <[email protected]>

Signed-off-by: Noah Cylich <[email protected]>

…zation Signed-off-by: Jisha Rajala <[email protected]>

Signed-off-by: Noah Cylich <[email protected]>

…n from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]>

Signed-off-by: Noah Cylich <[email protected]>

HenryNdubuaku · 2026-02-23T21:29:28Z

thanks @ncylich and @jrajala6 this improved decode speed 20%!

* Basic addition of int4 functionality Signed-off-by: Jisha Rajala <[email protected]> * basic working/optimized int4 gemv with testing Signed-off-by: Noah Cylich <[email protected]> * Added gemm functionality Signed-off-by: Jisha Rajala <[email protected]> * int4 is working!!! Signed-off-by: Noah Cylich <[email protected]> * semi-working int4 (too slow still) Signed-off-by: Noah Cylich <[email protected]> * gemma matmul benchmark Signed-off-by: Noah Cylich <[email protected]> * Added e2e gemma 3 model testing with int4 quantization vs int8 quantization Signed-off-by: Jisha Rajala <[email protected]> * further optimized alu ops using imm8 simd instructions Signed-off-by: Noah Cylich <[email protected]> * adding synthetic fake int4 testing option Signed-off-by: Noah Cylich <[email protected]> * trying to copy ggml int4 - but it has worse perf, did take inspiration from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]> * optimized int4 further Signed-off-by: Noah Cylich <[email protected]> * cleaned code in preparation for proper pr Signed-off-by: Noah Cylich <[email protected]> * fixed unpacking Signed-off-by: Noah Cylich <[email protected]> * fixed execution graph basic unpacking and analysis Signed-off-by: Noah Cylich <[email protected]> * added int4 support to graph ops Signed-off-by: Noah Cylich <[email protected]> * adding int4 to moe matmul Signed-off-by: Noah Cylich <[email protected]> * improved matmul code patterns Signed-off-by: Noah Cylich <[email protected]> * got rid if i8mm instructions, but significantly slower Signed-off-by: Noah Cylich <[email protected]> --------- Signed-off-by: Jisha Rajala <[email protected]> Signed-off-by: Noah Cylich <[email protected]> Co-authored-by: Noah Cylich <[email protected]>

ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 795220b to f574155 Compare February 13, 2026 05:35

jrajala6 commented Feb 13, 2026

View reviewed changes

cactus/engine/engine.h Outdated Show resolved Hide resolved

jrajala6 commented Feb 13, 2026

View reviewed changes

cactus/kernel/kernel_matmul.cpp Show resolved Hide resolved

ncylich force-pushed the int4_kernel_unpacking branch 3 times, most recently from b6b38e4 to ee85502 Compare February 18, 2026 21:31

ncylich marked this pull request as ready for review February 18, 2026 21:36

Copilot AI review requested due to automatic review settings February 18, 2026 21:36

Copilot started reviewing on behalf of ncylich February 18, 2026 21:36 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

cactus/kernel/kernel_quants.cpp Outdated Show resolved Hide resolved

cactus/kernel/kernel_quants.cpp Outdated Show resolved Hide resolved

cactus/graph/graph_execute.cpp Outdated Show resolved Hide resolved

ncylich force-pushed the int4_kernel_unpacking branch from 9f8284c to fd9d491 Compare February 18, 2026 21:47

ncylich closed this Feb 18, 2026

ncylich reopened this Feb 18, 2026

ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 383194f to 3ba0048 Compare February 18, 2026 22:30

ncylich requested a review from Copilot February 18, 2026 22:32

Copilot started reviewing on behalf of ncylich February 18, 2026 22:32 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

cactus/graph/graph_execute.cpp Outdated Show resolved Hide resolved

cactus/kernel/kernel_quants.cpp Outdated Show resolved Hide resolved

cactus/kernel/kernel_matmul.cpp Outdated Show resolved Hide resolved

ncylich requested a review from Copilot February 20, 2026 18:22

Copilot started reviewing on behalf of ncylich February 20, 2026 18:23 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

cactus/graph/graph_ops_tensor.cpp Show resolved Hide resolved

cactus/graph/graph_execute.cpp Show resolved Hide resolved

ncylich requested a review from Copilot February 20, 2026 19:47

Copilot started reviewing on behalf of ncylich February 20, 2026 19:48 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

jrajala6 and others added 5 commits February 22, 2026 22:52

Basic addition of int4 functionality

cda5ad5

Signed-off-by: Jisha Rajala <[email protected]>

basic working/optimized int4 gemv with testing

d07ef9a

Signed-off-by: Noah Cylich <[email protected]>

Added gemm functionality

6a23d23

Signed-off-by: Jisha Rajala <[email protected]>

int4 is working!!!

9be0264

Signed-off-by: Noah Cylich <[email protected]>

semi-working int4 (too slow still)

a5ec491

Signed-off-by: Noah Cylich <[email protected]>

ncylich and others added 10 commits February 22, 2026 22:53

gemma matmul benchmark

5d07c4a

Signed-off-by: Noah Cylich <[email protected]>

Added e2e gemma 3 model testing with int4 quantization vs int8 quanti…

483df4a

…zation Signed-off-by: Jisha Rajala <[email protected]>

further optimized alu ops using imm8 simd instructions

79461a2

Signed-off-by: Noah Cylich <[email protected]>

adding synthetic fake int4 testing option

248475c

Signed-off-by: Noah Cylich <[email protected]>

trying to copy ggml int4 - but it has worse perf, did take inspiratio…

bfcf0d7

…n from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]>

optimized int4 further

3c2ac72

Signed-off-by: Noah Cylich <[email protected]>

cleaned code in preparation for proper pr

dd6006a

Signed-off-by: Noah Cylich <[email protected]>

fixed unpacking

784bc3e

Signed-off-by: Noah Cylich <[email protected]>

fixed execution graph basic unpacking and analysis

0fd0220

Signed-off-by: Noah Cylich <[email protected]>

added int4 support to graph ops

b4d4345

Signed-off-by: Noah Cylich <[email protected]>

ncylich force-pushed the int4_kernel_unpacking branch from e41881c to b4d4345 Compare February 23, 2026 06:53

ncylich added 4 commits February 22, 2026 23:36

adding int4 to moe matmul

f18dade

Signed-off-by: Noah Cylich <[email protected]>

improved matmul code patterns

325d671

Signed-off-by: Noah Cylich <[email protected]>

got rid if i8mm instructions, but significantly slower

f00cb2e

Signed-off-by: Noah Cylich <[email protected]>

Merge remote-tracking branch 'upstream/main' into int4_kernel_unpacking

5ecfe87

Signed-off-by: Noah Cylich <[email protected]>

HenryNdubuaku merged commit 7af3e4b into cactus-compute:main Feb 23, 2026
1 of 4 checks passed

-    const size_t axis_stride_elements = axis_size * inner_elements;
+    const size_t axis_stride_elements = axis_size * inner_elements;
+    // For INT4 precision, element offsets passed to byte_offset_of must be multiples of 32.
+    // Enforce this alignment for the copy block size, axis stride, and slice start offset.
+    if (input_buffer.precision == Precision::INT4) {
+        const size_t slice_start_elements = slice_start * inner_elements;
+        if ((copy_block_elements % 32) != 0 ||
+            (axis_stride_elements % 32) != 0 ||
+            (slice_start_elements % 32) != 0) {
+            throw std::runtime_error("Slice on INT4 tensors requires 32-element alignment of slice length, axis stride, and slice start.");
+        }
+    }

Conversation

jrajala6 commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HenryNdubuaku commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants