This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.
We test our kernel with
CUDA 12.4andPyTorch 2.4. Install the required environments for MoA before installing the kernel.
cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py installpython accuracy_test.pyOur kernel is build upon FlashInfer project.
- support batch size > 1
- support multi-GPU inference
- support GQA