Skip to content

Releases: vectorch-ai/ScaleLLM

v0.2.6

13 Sep 00:06

Choose a tag to compare

What's Changed

  • fix: choose cuda arthitectures based on cuda version by @guocuimi in #463
  • kernel: add grouped gemm support for moe by @guocuimi in #458
  • kernel: added oob handling for grouped gemm kernel by @guocuimi in #465
  • refactor: add _1 into stride for contiguous dim by @guocuimi in #466
  • ci: set cuda arch to native for ci workflows by @guocuimi in #467
  • refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in #468
  • refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in #469
  • fix: skip failed unittests for blackwell gpus by @guocuimi in #472
  • feat: added single tile scheduler for attn kernel by @guocuimi in #473
  • feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in #474
  • refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in #475
  • feat: use global residue_mnk for oob handling by @guocuimi in #476
  • feat: simplify mask logic to avoid manual index computation by @guocuimi in #477
  • feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in #478
  • feat: added gtest_main with filters based on compute_capabilities by @guocuimi in #479
  • ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in #481
  • feat: add tma copy for paged kv by @guocuimi in #480
  • feat: added gather tma copy to control smem box size by @guocuimi in #482
  • feat: use aggressive compress-mode for fatbin by @guocuimi in #484
  • feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in #485
  • feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in #483
  • feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in #486
  • refactor: move kernel code into different folders by @guocuimi in #487
  • feat: added KV multi-stages support for attn sm120 by @guocuimi in #489
  • refactor: simplify mha block tiling logic by @guocuimi in #488
  • feat: added smem and gmem layout selector for attn kernel by @guocuimi in #490
  • feat: added args and params for attn kernels by @guocuimi in #491
  • feat: added universal fmha runner by @guocuimi in #492
  • feat: added kernel builder for attn by @guocuimi in #493
  • refactor: change stride for Q/K/V to MNKL by @guocuimi in #494
  • upgrade torch to 12.8 by @guocuimi in #496
  • ci: fix nccl related build error by @guocuimi in #497

Full Changelog: v0.2.5...v0.2.6

v0.2.5

27 May 07:58

Choose a tag to compare

What's Changed

Full Changelog: v0.2.4...v0.2.5

v0.2.4

02 Mar 02:34

Choose a tag to compare

What's Changed

  • ci: add option to skip nvbench build by @guocuimi in #390
  • ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
  • kernel: added query packing support for attention by @guocuimi in #392
  • refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
  • kernel: added triton aot compiler by @guocuimi in #394
  • kernel: generate smaller kernel instantiations by @guocuimi in #395
  • kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
  • upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
  • kernel: added simple MLA kernel by @guocuimi in #396
  • kernel: added pipeline support for mla by @guocuimi in #399
  • kernel: added ping-pong rmem support for MLA by @guocuimi in #400
  • kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
  • kernel: put query alwasy in registers for mha by @guocuimi in #402
  • kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
  • kernel: revert mla ping-pong rmem change by @guocuimi in #404
  • kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
  • kernel: added causal mask for MLA kernel by @guocuimi in #406
  • kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
  • kernel: fix mask bugs for MLA by @guocuimi in #408
  • kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
  • kernel: added stage support for MLA kernel by @guocuimi in #410
  • misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
  • kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
  • kernel: optimize mask loop for MLA kernel by @guocuimi in #414
  • kernel: added paged kv support for MLA kernel by @guocuimi in #415
  • kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
  • kernel: use FastDivmod in attention kernels by @guocuimi in #417

Full Changelog: v0.2.3...v0.2.4

v0.2.3

26 Jan 22:13

Choose a tag to compare

What's Changed

  • misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
  • upgrade pytorch to 2.5.1 by @guocuimi in #351
  • added cuda 12.6 build image by @guocuimi in #353
  • fix cmake version issue for manylinux image by @guocuimi in #354
  • kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
  • ci: fix package test workflow by @guocuimi in #357
  • kernel: refactor attention kernel for readibility by @guocuimi in #358
  • dev: config dev container with proper extensions by @guocuimi in #359
  • kernel: added attention bench for profiling before optimization by @guocuimi in #360
  • kernel: added logits soft cap support for attention by @guocuimi in #362
  • tools: added attention traits viewer by @guocuimi in #363
  • kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
  • kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
  • kernel: refactor attention kernel and add more unittests by @guocuimi in #366
  • kernel: added M/N OOB handling for attention by @guocuimi in #367
  • tools: update svg build to generate small file by @guocuimi in #368
  • kernel: Added attention params and tile for different input types. by @guocuimi in #369
  • kernel: added mqa and gqa support for attention by @guocuimi in #370
  • kernel: added var len and paged kv cache support for attention by @guocuimi in #371
  • kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
  • kernel: added attention kernel launch by @guocuimi in #373
  • kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
  • kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
  • kernel: added head_dim=96 support for attention by @guocuimi in #376
  • kernel: optimize attention kernel performance by @guocuimi in #377
  • upgrade cutlass to 3.7.0 by @guocuimi in #379
  • kernel: handle kv block range for attention kernel by @guocuimi in #382
  • kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
  • kernel: seperate oob iterations for better performance. by @guocuimi in #384
  • refactor: remove batch_prefill interface by @guocuimi in #385
  • refactor: stop build flash_infer kernel by @guocuimi in #386
  • feat: integrate in-house scale attention and use it by default by @guocuimi in #380
  • kernel: only zfill k once to improve perf for attention by @guocuimi in #387
  • refactor: skip flash_attn build by @guocuimi in #388
  • refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389

Full Changelog: v0.2.2...v0.2.3

v0.2.2

26 Oct 03:12

Choose a tag to compare

What's Changed

  • kernel: added flash infer attention impl by @guocuimi in #327
  • refactor: flatten block tables to 1d tensor by @guocuimi in #328
  • kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
  • refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
  • kernel: port flash infer handler + wrapper logics by @guocuimi in #331
  • ut: added unittests for flash infer kernels by @guocuimi in #332
  • refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
  • feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
  • refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
  • ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
  • ci: added pip cache to avoid redownloading by @guocuimi in #337
  • upgrade pytorch to 2.4.1 by @guocuimi in #341
  • ci: run package test in docker by @guocuimi in #345
  • ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
  • Upgrade pytorch to 2.5.0 by @guocuimi in #347
  • ut: add more tests for different warp layout by @guocuimi in #340
  • misc: attention kernel refactoring by @guocuimi in #339

Full Changelog: v0.2.1...v0.2.2

v0.2.1

04 Sep 23:00

Choose a tag to compare

What's Changed

  • feat: added awq marlin qlinear by @guocuimi in #315
  • build: speed up compilation for marlin kernels by @guocuimi in #316
  • test: added unittests for marlin kernels by @guocuimi in #317
  • refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
  • fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
  • cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
  • ci: allow build without requiring a physical gpu device by @guocuimi in #321
  • fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
  • refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
  • feat: fix and use marlin kernel for awq by default by @guocuimi in #326

Full Changelog: v0.2.0...v0.2.1

v0.2.0

22 Aug 01:49

Choose a tag to compare

What's Changed

  • kernel: port softcap support for flash attention by @guocuimi in #298
  • test: added unittests for attention sliding window by @guocuimi in #299
  • model: added gemma2 with softcap and sliding window support by @guocuimi in #300
  • kernel: support kernel test in python via pybind by @guocuimi in #301
  • test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
  • fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
  • refactor: move models to upper folder by @guocuimi in #306
  • kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
  • rust: upgrade rust libs to latest version by @guocuimi in #309
  • refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
  • feat: added fused column parallel linear by @guocuimi in #313
  • feat: added gptq marlin qlinear layer by @guocuimi in #312
  • kernel: port awq repack kernel by @guocuimi in #314

Full Changelog: v0.1.9...v0.2.0

v0.1.9

04 Aug 00:38

Choose a tag to compare

What's Changed

Full Changelog: v0.1.8...v0.1.9

v0.1.8

25 Jul 12:02
2e14170

Choose a tag to compare

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7

24 Jul 06:12
f0f7e07

Choose a tag to compare

What's Changed

  • build: fix build error with gcc-13 by @guocuimi in #264
  • kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
  • cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
  • feat: added range to support Range-for loops by @guocuimi in #267
  • kernel: added attention cpu implementation for testing by @guocuimi in #268
  • build: added nvbench as submodule by @guocuimi in #269
  • build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
  • ci: build and test in devel docker image by @guocuimi in #272
  • ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
  • attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
  • kernel: added playground for learning and experimenting cute. by @guocuimi in #274
  • feat: added rope scaling support for llama3.1 by @guocuimi in #277
  • update docs for llama3.1 support and bump up version by @guocuimi in #278

Full Changelog: v0.1.6...v0.1.7