MoA Kernel

This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.

Installation

We test our kernel with CUDA 12.4 and PyTorch 2.4. Install the required environments for MoA before installing the kernel.

cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install

Quick Test

python accuracy_test.py

Acknowledgement

Our kernel is build upon FlashInfer project.

TODO

support batch size > 1
support multi-GPU inference
support GQA

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
3rdparty		3rdparty
cmake		cmake
docs		docs
include/flashinfer		include/flashinfer
python		python
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
accuracy_test.py		accuracy_test.py
accuracy_test_decode.py		accuracy_test_decode.py
release-please-config.json		release-please-config.json
speed_test.py		speed_test.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoA Kernel

Installation

Quick Test

Acknowledgement

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

thu-nics/MoA_Kernel

Folders and files

Latest commit

History

Repository files navigation

MoA Kernel

Installation

Quick Test

Acknowledgement

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages