Fix hybrid attention sliding window for Gemma (#320) by jrajala6 · Pull Request #338 · cactus-compute/cactus

jrajala6 · 2026-02-10T22:02:09Z

The hybrid INT8 attention kernel (cactus_attention_hybrid_int8_fp16) wasn't using window_size during decode, so local attention layers in Gemma attended to the full context instead of just recent tokens, breaking the hybrid attention mechanism.

This pr threads window_size from model_gemma.cpp through the graph API down to the kernel. In the kernel, positions outside the window get masked to -inf before softmax. The parameter defaults to 0 (global attention) so nothing changes for other models.

Tested with test_kernel (10/10), test_graph (38/38), and end-to-end chat with Gemma 3 1B.

Signed-off-by: Jisha Rajala <[email protected]>

Signed-off-by: HenryNdubuaku <[email protected]>

HenryNdubuaku · 2026-02-12T07:47:11Z

thanks a million for this @jrajala6

* Added hybrid attention functionality to gemma Signed-off-by: Jisha Rajala <[email protected]> * Reduced diff in kernel_attention Signed-off-by: Jisha Rajala <[email protected]> * Reduced diffs in kernel_attention.cpp Signed-off-by: Jisha Rajala <[email protected]> * Remove int4_learning_lab.cpp from PR Signed-off-by: Jisha Rajala <[email protected]> * some changes Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Jisha Rajala <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>

jrajala6 closed this Feb 10, 2026

jrajala6 added 2 commits February 10, 2026 14:21

Added hybrid attention functionality to gemma

520dfd0

Signed-off-by: Jisha Rajala <[email protected]>

Reduced diff in kernel_attention

256b03c

Signed-off-by: Jisha Rajala <[email protected]>

jrajala6 reopened this Feb 10, 2026

Reduced diffs in kernel_attention.cpp

c0a7973

Signed-off-by: Jisha Rajala <[email protected]>

jrajala6 force-pushed the gemma_hybrid_attention branch from 1ac7c56 to c0a7973 Compare February 10, 2026 22:49

jrajala6 and others added 2 commits February 10, 2026 14:52

Remove int4_learning_lab.cpp from PR

d45e243

Signed-off-by: Jisha Rajala <[email protected]>

some changes

b4cadf2

Signed-off-by: HenryNdubuaku <[email protected]>

HenryNdubuaku merged commit 730d683 into cactus-compute:main Feb 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hybrid attention sliding window for Gemma (#320)#338

Fix hybrid attention sliding window for Gemma (#320)#338
HenryNdubuaku merged 5 commits intocactus-compute:mainfrom
jrajala6:gemma_hybrid_attention

jrajala6 commented Feb 10, 2026

Uh oh!

HenryNdubuaku commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrajala6 commented Feb 10, 2026

Uh oh!

HenryNdubuaku commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants