Releases · vectorch-ai/ScaleLLM

@guocuimi

What's Changed

fix: choose cuda arthitectures based on cuda version by @guocuimi in #463
kernel: add grouped gemm support for moe by @guocuimi in #458
kernel: added oob handling for grouped gemm kernel by @guocuimi in #465
refactor: add _1 into stride for contiguous dim by @guocuimi in #466
ci: set cuda arch to native for ci workflows by @guocuimi in #467
refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in #468
refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in #469
fix: skip failed unittests for blackwell gpus by @guocuimi in #472
feat: added single tile scheduler for attn kernel by @guocuimi in #473
feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in #474
refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in #475
feat: use global residue_mnk for oob handling by @guocuimi in #476
feat: simplify mask logic to avoid manual index computation by @guocuimi in #477
feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in #478
feat: added gtest_main with filters based on compute_capabilities by @guocuimi in #479
ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in #481
feat: add tma copy for paged kv by @guocuimi in #480
feat: added gather tma copy to control smem box size by @guocuimi in #482
feat: use aggressive compress-mode for fatbin by @guocuimi in #484
feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in #485
feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in #483
feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in #486
refactor: move kernel code into different folders by @guocuimi in #487
feat: added KV multi-stages support for attn sm120 by @guocuimi in #489
refactor: simplify mha block tiling logic by @guocuimi in #488
feat: added smem and gmem layout selector for attn kernel by @guocuimi in #490
feat: added args and params for attn kernels by @guocuimi in #491
feat: added universal fmha runner by @guocuimi in #492
feat: added kernel builder for attn by @guocuimi in #493
refactor: change stride for Q/K/V to MNKL by @guocuimi in #494
upgrade torch to 12.8 by @guocuimi in #496
ci: fix nccl related build error by @guocuimi in #497

Full Changelog: v0.2.5...v0.2.6

@guocuimi

What's Changed

ci: fix whell build script by @guocuimi in #418
kernel: added attention combine kernel to support split kv by @guocuimi in #419
kernel: refactor and added more unittests for attn combine kernel by @guocuimi in #420
moe: added token dispatcher interface for MOE layer. by @guocuimi in #421
moe: added local token dispatcher pytorch implementation for testing by @guocuimi in #422
nccl: added all2all for nccl process group by @guocuimi in #423
moe: added all to all token dispatcher pytorch implementation by @guocuimi in #424
upgrade cutlass to 3.9 by @guocuimi in #425
kernel: added fused gate for moe by @guocuimi in #426
chore: added pre-commit-config by @guocuimi in #427
kernel: added moe permute kernels by @guocuimi in #428
chore: clean up attn dependencies by @guocuimi in #429
chore: clean up JinjaChatTemplate by @guocuimi in #430
test: added different dtype unittests for moe permute kernels by @guocuimi in #431
refactor: use __ldlu to load/store data and refactor code for moe permute kernels by @guocuimi in #432
upgrade pytorch to 2.7 by @guocuimi in #434
chore: build manylinux2_28 builder image by @guocuimi in #435
fix: fix manylinux2_28 build by @guocuimi in #436
upgrade vcpkg after switch to manylinux_2_28 by @guocuimi in #437
chore: add option to install py module into scalellm folder by @guocuimi in #438
chore: add script to install zsh for devbox by @guocuimi in #439
ci: enable docker cache by @guocuimi in #441
kenerl: add kernel for moe permutation with mask map by @guocuimi in #433
kernel: added align block permutation kernel for moe by @guocuimi in #442
build: added build for blackwell by @guocuimi in #459
chore: upgrade cutlass to v4.0 by @guocuimi in #460
ci: change self-hosted runner tags by @guocuimi in #461

Full Changelog: v0.2.4...v0.2.5

@guocuimi

What's Changed

ci: add option to skip nvbench build by @guocuimi in #390
ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
kernel: added query packing support for attention by @guocuimi in #392
refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
kernel: added triton aot compiler by @guocuimi in #394
kernel: generate smaller kernel instantiations by @guocuimi in #395
kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
kernel: added simple MLA kernel by @guocuimi in #396
kernel: added pipeline support for mla by @guocuimi in #399
kernel: added ping-pong rmem support for MLA by @guocuimi in #400
kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
kernel: put query alwasy in registers for mha by @guocuimi in #402
kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
kernel: revert mla ping-pong rmem change by @guocuimi in #404
kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
kernel: added causal mask for MLA kernel by @guocuimi in #406
kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
kernel: fix mask bugs for MLA by @guocuimi in #408
kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
kernel: added stage support for MLA kernel by @guocuimi in #410
misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
kernel: optimize mask loop for MLA kernel by @guocuimi in #414
kernel: added paged kv support for MLA kernel by @guocuimi in #415
kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
kernel: use FastDivmod in attention kernels by @guocuimi in #417

Full Changelog: v0.2.3...v0.2.4

@guocuimi

What's Changed

misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
upgrade pytorch to 2.5.1 by @guocuimi in #351
added cuda 12.6 build image by @guocuimi in #353
fix cmake version issue for manylinux image by @guocuimi in #354
kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
ci: fix package test workflow by @guocuimi in #357
kernel: refactor attention kernel for readibility by @guocuimi in #358
dev: config dev container with proper extensions by @guocuimi in #359
kernel: added attention bench for profiling before optimization by @guocuimi in #360
kernel: added logits soft cap support for attention by @guocuimi in #362
tools: added attention traits viewer by @guocuimi in #363
kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
kernel: refactor attention kernel and add more unittests by @guocuimi in #366
kernel: added M/N OOB handling for attention by @guocuimi in #367
tools: update svg build to generate small file by @guocuimi in #368
kernel: Added attention params and tile for different input types. by @guocuimi in #369
kernel: added mqa and gqa support for attention by @guocuimi in #370
kernel: added var len and paged kv cache support for attention by @guocuimi in #371
kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
kernel: added attention kernel launch by @guocuimi in #373
kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
kernel: added head_dim=96 support for attention by @guocuimi in #376
kernel: optimize attention kernel performance by @guocuimi in #377
upgrade cutlass to 3.7.0 by @guocuimi in #379
kernel: handle kv block range for attention kernel by @guocuimi in #382
kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
kernel: seperate oob iterations for better performance. by @guocuimi in #384
refactor: remove batch_prefill interface by @guocuimi in #385
refactor: stop build flash_infer kernel by @guocuimi in #386
feat: integrate in-house scale attention and use it by default by @guocuimi in #380
kernel: only zfill k once to improve perf for attention by @guocuimi in #387
refactor: skip flash_attn build by @guocuimi in #388
refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3

@guocuimi

What's Changed

kernel: added flash infer attention impl by @guocuimi in #327
refactor: flatten block tables to 1d tensor by @guocuimi in #328
kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
kernel: port flash infer handler + wrapper logics by @guocuimi in #331
ut: added unittests for flash infer kernels by @guocuimi in #332
refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
ci: added pip cache to avoid redownloading by @guocuimi in #337
upgrade pytorch to 2.4.1 by @guocuimi in #341
ci: run package test in docker by @guocuimi in #345
ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
Upgrade pytorch to 2.5.0 by @guocuimi in #347
ut: add more tests for different warp layout by @guocuimi in #340
misc: attention kernel refactoring by @guocuimi in #339

Full Changelog: v0.2.1...v0.2.2

@guocuimi

What's Changed

feat: added awq marlin qlinear by @guocuimi in #315
build: speed up compilation for marlin kernels by @guocuimi in #316
test: added unittests for marlin kernels by @guocuimi in #317
refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
ci: allow build without requiring a physical gpu device by @guocuimi in #321
fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

@guocuimi

What's Changed

kernel: port softcap support for flash attention by @guocuimi in #298
test: added unittests for attention sliding window by @guocuimi in #299
model: added gemma2 with softcap and sliding window support by @guocuimi in #300
kernel: support kernel test in python via pybind by @guocuimi in #301
test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
refactor: move models to upper folder by @guocuimi in #306
kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
rust: upgrade rust libs to latest version by @guocuimi in #309
refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
feat: added fused column parallel linear by @guocuimi in #313
feat: added gptq marlin qlinear layer by @guocuimi in #312
kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

@guocuimi

What's Changed

ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
pypi: fix invalid classifier by @guocuimi in #284
refactor: remove exllama kernels by @guocuimi in #285
kernel: added marlin dense and sparse kernels by @guocuimi in #287
debug: added environment collection script. by @guocuimi in #288
kernel: added triton kernel build support by @guocuimi in #289
feat: added THUDM/glm-4* support by @guocuimi in #292
fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
triton: fix build error and add example with unittest by @guocuimi in #294
model: added qwen2 support by @guocuimi in #295
feat: added sliding window support for QWen2 by @guocuimi in #296
ci: fix pytest version to avoid flakiness by @guocuimi in #297

Full Changelog: v0.1.8...v0.1.9

@guocuimi

What's Changed

ci: increase ccache max size from 5GB(default) to 25GB by @guocuimi in #279
upgrade torch to 2.4.0 by @guocuimi in #280
default use cuda 12.1 for wheel package by @guocuimi in #281
ci: fix cuda version for wheel build workflow by @guocuimi in #282

Full Changelog: v0.1.7...v0.1.8

@guocuimi

What's Changed

build: fix build error with gcc-13 by @guocuimi in #264
kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
feat: added range to support Range-for loops by @guocuimi in #267
kernel: added attention cpu implementation for testing by @guocuimi in #268
build: added nvbench as submodule by @guocuimi in #269
build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
ci: build and test in devel docker image by @guocuimi in #272
ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
kernel: added playground for learning and experimenting cute. by @guocuimi in #274
feat: added rope scaling support for llama3.1 by @guocuimi in #277
update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7

Releases: vectorch-ai/ScaleLLM

v0.2.6

What's Changed

Contributors

Uh oh!

v0.2.5

What's Changed

Contributors

Uh oh!

v0.2.4

What's Changed

Contributors

Uh oh!

v0.2.3

What's Changed

Contributors

Uh oh!

v0.2.2

What's Changed

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.2.0

What's Changed

Contributors

Uh oh!

v0.1.9

What's Changed

Contributors

Uh oh!

v0.1.8

What's Changed

Contributors

Uh oh!

v0.1.7

What's Changed

Contributors

Uh oh!