We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?
Introducing Log-Linear Attention with:
- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Han Guo
3,449 posts
PhD Student @MIT_CSAIL | Past: @togethercompute @LTIatCMU @MITIBMLab @UNCNLP, @SFResearch, @BaiduResearch | Machine Learning, NLP.
Joined August 2016
- Introducing LQ-LoRA Decomposing pretrained matrices into (fixed) quantized + (trainable) low-rank components enables more aggressive quantization. We can quantize LLaMA-2 70B to 2.5 bits with minimal degradation in instruction-tuning performance. arxiv.org/abs/2311.12023 π§΅1/n
- Introducing FLUTE, a CUDA kernel for non-uniformly quantized (via a lookup table) LLM Inference. It accelerates QLoRA's NormalFloat (NF) out of the box and more. As an application, we extended NF4 and are releasing quantized models for LLaMA-3 (8B/70B) and Gemma-2 (9B/27B).
- Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 etWe know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
- While I'm not at #EMNLP2022, we have two works on the intersection of RL + NLP. RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning (arxiv.org/abs/2205.12548) Efficient (Soft) Q-Learning for Text Generation with Limited Good Data (arxiv.org/abs/2106.07704)
- Glad to share our latest work "FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging"! Joint work with @nazneenrajani @peterbhase @mohitban47 @caimingxiong (@uncnlp @sfresearch). Paper: arxiv.org/abs/2012.15781 Code: github.com/salesforce/fas⦠1/5
- Super excited to be among this cohort of amazing people! A huge thanks to @ericxing, @yoonrkim, @ZhitingHu, @mohitban47, and everyone who provided mentorship and advice!!At Microsoft Research, we aim to empower the next generation of computing related research talent. Today, we're thrilled to announce and congratulate this year's Microsoft Research PhD Fellowship recipients from around the world. Meet the 2022 recipients: aka.ms/phdfellowship
- Excited to share that I'll be joining @LTIatCMU as a PhD student this fall after three wonderful undergraduate years at @UNCNLP! Huge thanks to everyone who gave me mentorship and help along the way, especially my advisor Mohit @mohitban47 and collaborator Ram @ramakanth1729! π
- I've had some chances recently to share what we've been working on. In doing so, I made a few basic background slides that explain `torch.matmul` from GPU/CUDA's point of view, why LLM decoding is memory bound, and how weight-only quantization could speed up decoding. Slides π
- Happy to share that LQ-LoRA will appear at #ICLR2024. TLDR: using matrix decomposition to enable more aggressive quantization before LoRA fine-tuning. - Paper (updated): arxiv.org/abs/2311.12023. - Code (with more artifacts uploaded such as models): github.com/HanGuo97/lq-loβ¦.Introducing LQ-LoRA Decomposing pretrained matrices into (fixed) quantized + (trainable) low-rank components enables more aggressive quantization. We can quantize LLaMA-2 70B to 2.5 bits with minimal degradation in instruction-tuning performance. arxiv.org/abs/2311.12023 π§΅1/narxiv.orgLQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for...We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision...
- Excited to share our latest work with Bowen Tan @waterluffy Eric Xing @ZhitingHu! Tldr, a new NLG formulation from soft Q-learning perspective, with app. such as learning from noisy data, text attacks, prompt generation. Paper arxiv.org/abs/2106.07704 Code github.com/HanGuo97/soft-β¦
- Unfortunately, I won't be at #ICLR2023, but please check out our recent works on Machine Learning + Systems! 1. Federated Learning as Variational Inference iclr.cc/virtual/2023/pβ¦ 2. MPCFormer: Fast, Performant, and Private Transformer inference with MPC iclr.cc/virtual/2023/pβ¦
- Replying to @HanGuo97There has been much recent work on efficient alternatives with sub-quadratic compute and sub-linear memory, including linear attention, state-space models, and long convolution models. Despite their differences, many of these approaches can be captured by the following equation:
- Happy to share that our FastIF paper's been accepted at #EMNLP2021! Thanks to wonderful coauthors @nazneenrajani @peterbhase @mohitban47 @CaimingXiong @uncnlp @SFResearch @LTIatCMU Updated paper/code (w. more exps on ANLI/WILDS): arxiv.org/abs/2012.15781 github.com/salesforce/fasβ¦
























