Skip to content

Basic addition of int4 functionality#343

Merged
HenryNdubuaku merged 19 commits intocactus-compute:mainfrom
jrajala6:int4_kernel_unpacking
Feb 23, 2026
Merged

Basic addition of int4 functionality#343
HenryNdubuaku merged 19 commits intocactus-compute:mainfrom
jrajala6:int4_kernel_unpacking

Conversation

@jrajala6
Copy link
Copy Markdown
Contributor

No description provided.

@ncylich ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 795220b to f574155 Compare February 13, 2026 05:35
@ncylich ncylich force-pushed the int4_kernel_unpacking branch 3 times, most recently from b6b38e4 to ee85502 Compare February 18, 2026 21:31
@ncylich ncylich marked this pull request as ready for review February 18, 2026 21:36
Copilot AI review requested due to automatic review settings February 18, 2026 21:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial INT4 (4-bit) grouped-weight matmul support end-to-end (Python export → graph loader/executor → NEON kernel), plus a new kernel correctness test.

Changes:

  • Introduces INT4 GEMV/GEMM kernels with packed (nibble) weight format and integrates dispatch in graph matmul.
  • Updates Python tensor export to pack INT4 weights (planar layout) and avoids INT4 for embedding weights.
  • Adjusts graph byte sizing logic to use packed sizes for INT4 across buffer sizing, I/O, and debug capture.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_kernel.cpp Adds INT4 matmul correctness test (GEMV + GEMM).
python/src/tensor_io.py Changes INT4 packing to planar layout; forces embedding weights to INT8 when INT4 requested.
cactus/kernel/kernel_quants.cpp Updates INT4 nibble decode to offset-binary; (unpack layout needs follow-up).
cactus/kernel/kernel_matmul.cpp Adds INT4 GEMV/GEMM/matmul implementation (packed weights + optional bias correction).
cactus/kernel/kernel.h Exposes INT4 matmul APIs.
cactus/graph/graph_ops_tensor.cpp Uses packed byte sizing for gather copies (INT4-aware sizing).
cactus/graph/graph_ops_nn.cpp Routes grouped INT4 RHS to cactus_matmul_int4 and updates error messages.
cactus/graph/graph_io.cpp Uses packed byte sizing; adds INT4 handling in mmap paths; (INT4 unpack + save scales need follow-up).
cactus/graph/graph_execute.cpp Uses packed byte sizing for debug capture writes.
cactus/graph/graph_core.cpp Computes BufferDesc::byte_size using packed byte sizing.
cactus/graph/graph.h Adds is_grouped_int4 and packed_size_of() helper used across graph.
cactus/engine/engine.h Adds constructors to index structs.
.gitignore Adds *_profile.txt and normalizes *.bin entry.
Comments suppressed due to low confidence (2)

cactus/graph/graph_io.cpp:33

  • This INT4 unpack helper assumes pairwise nibble packing (alternating low/high outputs). The Python INT4 writer now uses planar packing (first 16 values in low nibbles, next 16 in high nibbles), so this unpacking will reorder values incorrectly when enable_int4_packing is disabled. Update it to expand each byte into two outputs placed 16 apart within each 16-byte block (or otherwise reconstruct low[0..15] then high[0..15]).
    inline void unpack_int4_to_int8(const uint8_t* packed, int8_t* unpacked, size_t packed_size) {
        for (size_t i = 0; i < packed_size; i++) {
            uint8_t byte = packed[i];

            int8_t low = static_cast<int8_t>((byte & 0x0F) - 8);

            int8_t high = static_cast<int8_t>(((byte >> 4) & 0x0F) - 8);

            unpacked[i * 2] = low;
            unpacked[i * 2 + 1] = high;
        }

cactus/graph/graph_io.cpp:192

  • save_node() only sets FLAG_HAS_SCALES / group_size / num_groups for grouped INT8 buffers. Group-wise INT4 tensors also have scales (and are now supported elsewhere), so saving an INT4 node will drop its scales metadata and produce an invalid/incomplete file. Extend has_scales to include grouped INT4 (and ensure the header encodes the group params/scales bytes for INT4 too).
    size_t byte_size = PrecisionTraits::packed_size_of(precision, total_elements);

    bool has_scales = (precision == Precision::INT8 && buffer.is_grouped_int8() && buffer.scales_data);
    size_t N = shape.size() >= 1 ? shape[0] : 1;
    size_t scales_bytes = has_scales ? (N * buffer.num_groups * sizeof(__fp16)) : 0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ncylich ncylich force-pushed the int4_kernel_unpacking branch from 9f8284c to fd9d491 Compare February 18, 2026 21:47
@ncylich ncylich closed this Feb 18, 2026
@ncylich ncylich reopened this Feb 18, 2026
@ncylich ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 383194f to 3ba0048 Compare February 18, 2026 22:30
@ncylich ncylich requested a review from Copilot February 18, 2026 22:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +63 to +65
std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size),
tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size),
bytes_per_element);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +81
std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size),
tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size),
bytes_per_element);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.

Copilot uses AI. Check for mistakes.
}

const size_t copy_block_elements = slice_length * inner_elements;
const size_t axis_stride_elements = axis_size * inner_elements;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, copy_block_elements and axis_stride_elements may not be multiples of 32 if the slice dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors along axes where the slice length or stride doesn't result in multiples of 32 elements.

Suggested change
const size_t axis_stride_elements = axis_size * inner_elements;
const size_t axis_stride_elements = axis_size * inner_elements;
// For INT4 precision, element offsets passed to byte_offset_of must be multiples of 32.
// Enforce this alignment for the copy block size, axis stride, and slice start offset.
if (input_buffer.precision == Precision::INT4) {
const size_t slice_start_elements = slice_start * inner_elements;
if ((copy_block_elements % 32) != 0 ||
(axis_stride_elements % 32) != 0 ||
(slice_start_elements % 32) != 0) {
throw std::runtime_error("Slice on INT4 tensors requires 32-element alignment of slice length, axis stride, and slice start.");
}
}

Copilot uses AI. Check for mistakes.

for (size_t outer = 0; outer < outer_elements; ++outer) {
const char* src = input_ptr + outer * axis_stride_bytes + slice_start * inner_elements * element_size;
const char* src = input_ptr + outer * axis_stride_bytes + PrecisionTraits::byte_offset_of(input_buffer.precision, slice_start * inner_elements);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, slice_start * inner_elements may not be a multiple of 32 if the slice start position or inner dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors.

Copilot uses AI. Check for mistakes.
Comment on lines +379 to +381
char* output_offset_bytes = output_data + PrecisionTraits::byte_offset_of(input_buffer.precision, output_idx);
const char* input_offset_bytes = input_data + PrecisionTraits::byte_offset_of(input_buffer.precision, input_base);
size_t length = PrecisionTraits::byte_offset_of(input_buffer.precision, slice_size);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, output_idx, input_base, and slice_size may not be multiples of 32 depending on the tensor shape and indexing parameters. This could cause assertion failures when indexing into INT4 tensors.

Copilot uses AI. Check for mistakes.
@ncylich ncylich force-pushed the int4_kernel_unpacking branch from e41881c to b4d4345 Compare February 23, 2026 06:53
@HenryNdubuaku
Copy link
Copy Markdown
Collaborator

thanks @ncylich and @jrajala6 this improved decode speed 20%!

@HenryNdubuaku HenryNdubuaku merged commit 7af3e4b into cactus-compute:main Feb 23, 2026
1 of 4 checks passed
ncylich added a commit that referenced this pull request Feb 24, 2026
* Basic addition of int4 functionality

Signed-off-by: Jisha Rajala <[email protected]>

* basic working/optimized int4 gemv with testing

Signed-off-by: Noah Cylich <[email protected]>

* Added gemm functionality

Signed-off-by: Jisha Rajala <[email protected]>

* int4 is working!!!

Signed-off-by: Noah Cylich <[email protected]>

* semi-working int4 (too slow still)

Signed-off-by: Noah Cylich <[email protected]>

* gemma matmul benchmark

Signed-off-by: Noah Cylich <[email protected]>

* Added e2e gemma 3 model testing with int4 quantization vs int8 quantization

Signed-off-by: Jisha Rajala <[email protected]>

* further optimized alu ops using imm8 simd instructions

Signed-off-by: Noah Cylich <[email protected]>

* adding synthetic fake int4 testing option

Signed-off-by: Noah Cylich <[email protected]>

* trying to copy ggml int4 - but it has worse perf, did take inspiration from it for a colocated kernel though

Signed-off-by: Noah Cylich <[email protected]>

* optimized int4 further

Signed-off-by: Noah Cylich <[email protected]>

* cleaned code in preparation for proper pr

Signed-off-by: Noah Cylich <[email protected]>

* fixed unpacking

Signed-off-by: Noah Cylich <[email protected]>

* fixed execution graph basic unpacking and analysis

Signed-off-by: Noah Cylich <[email protected]>

* added int4 support to graph ops

Signed-off-by: Noah Cylich <[email protected]>

* adding int4 to moe matmul

Signed-off-by: Noah Cylich <[email protected]>

* improved matmul code patterns

Signed-off-by: Noah Cylich <[email protected]>

* got rid if i8mm instructions, but significantly slower

Signed-off-by: Noah Cylich <[email protected]>

---------

Signed-off-by: Jisha Rajala <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Co-authored-by: Noah Cylich <[email protected]>
cattermelon1234 pushed a commit to cattermelon1234/cactus that referenced this pull request Feb 28, 2026
* Basic addition of int4 functionality

Signed-off-by: Jisha Rajala <[email protected]>

* basic working/optimized int4 gemv with testing

Signed-off-by: Noah Cylich <[email protected]>

* Added gemm functionality

Signed-off-by: Jisha Rajala <[email protected]>

* int4 is working!!!

Signed-off-by: Noah Cylich <[email protected]>

* semi-working int4 (too slow still)

Signed-off-by: Noah Cylich <[email protected]>

* gemma matmul benchmark

Signed-off-by: Noah Cylich <[email protected]>

* Added e2e gemma 3 model testing with int4 quantization vs int8 quantization

Signed-off-by: Jisha Rajala <[email protected]>

* further optimized alu ops using imm8 simd instructions

Signed-off-by: Noah Cylich <[email protected]>

* adding synthetic fake int4 testing option

Signed-off-by: Noah Cylich <[email protected]>

* trying to copy ggml int4 - but it has worse perf, did take inspiration from it for a colocated kernel though

Signed-off-by: Noah Cylich <[email protected]>

* optimized int4 further

Signed-off-by: Noah Cylich <[email protected]>

* cleaned code in preparation for proper pr

Signed-off-by: Noah Cylich <[email protected]>

* fixed unpacking

Signed-off-by: Noah Cylich <[email protected]>

* fixed execution graph basic unpacking and analysis

Signed-off-by: Noah Cylich <[email protected]>

* added int4 support to graph ops

Signed-off-by: Noah Cylich <[email protected]>

* adding int4 to moe matmul

Signed-off-by: Noah Cylich <[email protected]>

* improved matmul code patterns

Signed-off-by: Noah Cylich <[email protected]>

* got rid if i8mm instructions, but significantly slower

Signed-off-by: Noah Cylich <[email protected]>

---------

Signed-off-by: Jisha Rajala <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Co-authored-by: Noah Cylich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants