Basic addition of int4 functionality#343
Conversation
795220b to
f574155
Compare
b6b38e4 to
ee85502
Compare
There was a problem hiding this comment.
Pull request overview
Adds initial INT4 (4-bit) grouped-weight matmul support end-to-end (Python export → graph loader/executor → NEON kernel), plus a new kernel correctness test.
Changes:
- Introduces INT4 GEMV/GEMM kernels with packed (nibble) weight format and integrates dispatch in graph matmul.
- Updates Python tensor export to pack INT4 weights (planar layout) and avoids INT4 for embedding weights.
- Adjusts graph byte sizing logic to use packed sizes for INT4 across buffer sizing, I/O, and debug capture.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_kernel.cpp | Adds INT4 matmul correctness test (GEMV + GEMM). |
| python/src/tensor_io.py | Changes INT4 packing to planar layout; forces embedding weights to INT8 when INT4 requested. |
| cactus/kernel/kernel_quants.cpp | Updates INT4 nibble decode to offset-binary; (unpack layout needs follow-up). |
| cactus/kernel/kernel_matmul.cpp | Adds INT4 GEMV/GEMM/matmul implementation (packed weights + optional bias correction). |
| cactus/kernel/kernel.h | Exposes INT4 matmul APIs. |
| cactus/graph/graph_ops_tensor.cpp | Uses packed byte sizing for gather copies (INT4-aware sizing). |
| cactus/graph/graph_ops_nn.cpp | Routes grouped INT4 RHS to cactus_matmul_int4 and updates error messages. |
| cactus/graph/graph_io.cpp | Uses packed byte sizing; adds INT4 handling in mmap paths; (INT4 unpack + save scales need follow-up). |
| cactus/graph/graph_execute.cpp | Uses packed byte sizing for debug capture writes. |
| cactus/graph/graph_core.cpp | Computes BufferDesc::byte_size using packed byte sizing. |
| cactus/graph/graph.h | Adds is_grouped_int4 and packed_size_of() helper used across graph. |
| cactus/engine/engine.h | Adds constructors to index structs. |
| .gitignore | Adds *_profile.txt and normalizes *.bin entry. |
Comments suppressed due to low confidence (2)
cactus/graph/graph_io.cpp:33
- This INT4 unpack helper assumes pairwise nibble packing (alternating low/high outputs). The Python INT4 writer now uses planar packing (first 16 values in low nibbles, next 16 in high nibbles), so this unpacking will reorder values incorrectly when enable_int4_packing is disabled. Update it to expand each byte into two outputs placed 16 apart within each 16-byte block (or otherwise reconstruct low[0..15] then high[0..15]).
inline void unpack_int4_to_int8(const uint8_t* packed, int8_t* unpacked, size_t packed_size) {
for (size_t i = 0; i < packed_size; i++) {
uint8_t byte = packed[i];
int8_t low = static_cast<int8_t>((byte & 0x0F) - 8);
int8_t high = static_cast<int8_t>(((byte >> 4) & 0x0F) - 8);
unpacked[i * 2] = low;
unpacked[i * 2 + 1] = high;
}
cactus/graph/graph_io.cpp:192
- save_node() only sets FLAG_HAS_SCALES / group_size / num_groups for grouped INT8 buffers. Group-wise INT4 tensors also have scales (and are now supported elsewhere), so saving an INT4 node will drop its scales metadata and produce an invalid/incomplete file. Extend has_scales to include grouped INT4 (and ensure the header encodes the group params/scales bytes for INT4 too).
size_t byte_size = PrecisionTraits::packed_size_of(precision, total_elements);
bool has_scales = (precision == Precision::INT8 && buffer.is_grouped_int8() && buffer.scales_data);
size_t N = shape.size() >= 1 ? shape[0] : 1;
size_t scales_bytes = has_scales ? (N * buffer.num_groups * sizeof(__fp16)) : 0;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9f8284c to
fd9d491
Compare
383194f to
3ba0048
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cactus/graph/graph_ops_tensor.cpp
Outdated
| std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size), | ||
| tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size), | ||
| bytes_per_element); |
There was a problem hiding this comment.
The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.
cactus/graph/graph_ops_tensor.cpp
Outdated
| std::memcpy(output + PrecisionTraits::byte_offset_of(prec, i * element_size), | ||
| tensor_data + PrecisionTraits::byte_offset_of(prec, idx * element_size), | ||
| bytes_per_element); |
There was a problem hiding this comment.
The byte_offset_of function for INT4 precision requires element offsets to be multiples of 32 (as enforced by the assert in graph.h line 153). However, in the gather operation, element_size (which is the product of all dimensions except the first) may not be a multiple of 32. This means that for indices i > 0, the expression i * element_size may not satisfy the alignment requirement, causing an assertion failure at runtime. This issue will occur when gathering from INT4 tensors where the inner dimensions don't multiply to a value that's a multiple of 32.
| } | ||
|
|
||
| const size_t copy_block_elements = slice_length * inner_elements; | ||
| const size_t axis_stride_elements = axis_size * inner_elements; |
There was a problem hiding this comment.
The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, copy_block_elements and axis_stride_elements may not be multiples of 32 if the slice dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors along axes where the slice length or stride doesn't result in multiples of 32 elements.
| const size_t axis_stride_elements = axis_size * inner_elements; | |
| const size_t axis_stride_elements = axis_size * inner_elements; | |
| // For INT4 precision, element offsets passed to byte_offset_of must be multiples of 32. | |
| // Enforce this alignment for the copy block size, axis stride, and slice start offset. | |
| if (input_buffer.precision == Precision::INT4) { | |
| const size_t slice_start_elements = slice_start * inner_elements; | |
| if ((copy_block_elements % 32) != 0 || | |
| (axis_stride_elements % 32) != 0 || | |
| (slice_start_elements % 32) != 0) { | |
| throw std::runtime_error("Slice on INT4 tensors requires 32-element alignment of slice length, axis stride, and slice start."); | |
| } | |
| } |
|
|
||
| for (size_t outer = 0; outer < outer_elements; ++outer) { | ||
| const char* src = input_ptr + outer * axis_stride_bytes + slice_start * inner_elements * element_size; | ||
| const char* src = input_ptr + outer * axis_stride_bytes + PrecisionTraits::byte_offset_of(input_buffer.precision, slice_start * inner_elements); |
There was a problem hiding this comment.
The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, slice_start * inner_elements may not be a multiple of 32 if the slice start position or inner dimensions don't align with the INT4 packing boundary. This could cause assertion failures when slicing INT4 tensors.
| char* output_offset_bytes = output_data + PrecisionTraits::byte_offset_of(input_buffer.precision, output_idx); | ||
| const char* input_offset_bytes = input_data + PrecisionTraits::byte_offset_of(input_buffer.precision, input_base); | ||
| size_t length = PrecisionTraits::byte_offset_of(input_buffer.precision, slice_size); |
There was a problem hiding this comment.
The byte_offset_of function requires element offsets to be multiples of 32 for INT4 precision. However, output_idx, input_base, and slice_size may not be multiples of 32 depending on the tensor shape and indexing parameters. This could cause assertion failures when indexing into INT4 tensors.
Signed-off-by: Jisha Rajala <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Jisha Rajala <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
…zation Signed-off-by: Jisha Rajala <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
…n from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
e41881c to
b4d4345
Compare
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: Noah Cylich <[email protected]>
* Basic addition of int4 functionality Signed-off-by: Jisha Rajala <[email protected]> * basic working/optimized int4 gemv with testing Signed-off-by: Noah Cylich <[email protected]> * Added gemm functionality Signed-off-by: Jisha Rajala <[email protected]> * int4 is working!!! Signed-off-by: Noah Cylich <[email protected]> * semi-working int4 (too slow still) Signed-off-by: Noah Cylich <[email protected]> * gemma matmul benchmark Signed-off-by: Noah Cylich <[email protected]> * Added e2e gemma 3 model testing with int4 quantization vs int8 quantization Signed-off-by: Jisha Rajala <[email protected]> * further optimized alu ops using imm8 simd instructions Signed-off-by: Noah Cylich <[email protected]> * adding synthetic fake int4 testing option Signed-off-by: Noah Cylich <[email protected]> * trying to copy ggml int4 - but it has worse perf, did take inspiration from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]> * optimized int4 further Signed-off-by: Noah Cylich <[email protected]> * cleaned code in preparation for proper pr Signed-off-by: Noah Cylich <[email protected]> * fixed unpacking Signed-off-by: Noah Cylich <[email protected]> * fixed execution graph basic unpacking and analysis Signed-off-by: Noah Cylich <[email protected]> * added int4 support to graph ops Signed-off-by: Noah Cylich <[email protected]> * adding int4 to moe matmul Signed-off-by: Noah Cylich <[email protected]> * improved matmul code patterns Signed-off-by: Noah Cylich <[email protected]> * got rid if i8mm instructions, but significantly slower Signed-off-by: Noah Cylich <[email protected]> --------- Signed-off-by: Jisha Rajala <[email protected]> Signed-off-by: Noah Cylich <[email protected]> Co-authored-by: Noah Cylich <[email protected]>
* Basic addition of int4 functionality Signed-off-by: Jisha Rajala <[email protected]> * basic working/optimized int4 gemv with testing Signed-off-by: Noah Cylich <[email protected]> * Added gemm functionality Signed-off-by: Jisha Rajala <[email protected]> * int4 is working!!! Signed-off-by: Noah Cylich <[email protected]> * semi-working int4 (too slow still) Signed-off-by: Noah Cylich <[email protected]> * gemma matmul benchmark Signed-off-by: Noah Cylich <[email protected]> * Added e2e gemma 3 model testing with int4 quantization vs int8 quantization Signed-off-by: Jisha Rajala <[email protected]> * further optimized alu ops using imm8 simd instructions Signed-off-by: Noah Cylich <[email protected]> * adding synthetic fake int4 testing option Signed-off-by: Noah Cylich <[email protected]> * trying to copy ggml int4 - but it has worse perf, did take inspiration from it for a colocated kernel though Signed-off-by: Noah Cylich <[email protected]> * optimized int4 further Signed-off-by: Noah Cylich <[email protected]> * cleaned code in preparation for proper pr Signed-off-by: Noah Cylich <[email protected]> * fixed unpacking Signed-off-by: Noah Cylich <[email protected]> * fixed execution graph basic unpacking and analysis Signed-off-by: Noah Cylich <[email protected]> * added int4 support to graph ops Signed-off-by: Noah Cylich <[email protected]> * adding int4 to moe matmul Signed-off-by: Noah Cylich <[email protected]> * improved matmul code patterns Signed-off-by: Noah Cylich <[email protected]> * got rid if i8mm instructions, but significantly slower Signed-off-by: Noah Cylich <[email protected]> --------- Signed-off-by: Jisha Rajala <[email protected]> Signed-off-by: Noah Cylich <[email protected]> Co-authored-by: Noah Cylich <[email protected]>
No description provided.