Quadratic attention has been indispensable for information-dense modalities such as language... until now.
Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried.
With @tri_dao 1/
Joined December 2018
- pretty sure one of these does not belong, but thanks TIME 🤔TIME's new cover: The 100 most influential people in AI ti.me/4dQcJ1Q
- Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data - excited to finally release Mamba-2!! 8x larger states, 50% faster training, and even more S's 🐍🐍 Mamba-2 aims to advance the theory of sequence models, developing a framework of connections between SSMs and (linear) attention that we call state space duality (SSD) w/@tri_dao
- (1/n) Excited to release 2 preprints that describe our progress on sequence modeling for long-range dependencies! arxiv.org/abs/2110.13985 (NeurIPS ‘21) arxiv.org/abs/2111.00396 We build a new class of state space models that improve perf. on the Long Range Arena by 20 points!
- I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
- a common belief is that Transformers scale well because of less inductive bias, when it actually does have specific inductive biases. we developed H-Nets not to fix tokenization, but because I think that dynamic chunking represents a fundamental primitive that captures a biasA common takeaway from "the bitter lesson" is we don't need to put effort into encoding inductive biases, we just need compute. Nothing could be further from the truth! Better inductive biases mean better scaling exponents, which means exponential improvements with computation.
- I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about buildingSSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: arxiv.org/pdf/2510.14826 🧵
- immensely proud of the team for our best model yet. grateful to be able to work with such a strong team of researchers who are always curious and willing to explore the untrodden path
- SSMs go brrrr super excited to announce the first model powered by our latest research into efficient architectures 👀 stay tuned for more details soon!Today, we’re excited to release the first step in our mission to build real time multimodal intelligence for every device: Sonic, a blazing fast (🚀 135ms model latency), lifelike generative voice model and API. Read cartesia.ai/blog/sonic and try Sonic play.cartesia.ai
- distillation.... mmm 🍻 state-of-the-art Mamba models with 1% of the compute, by leveraging pretrained Transformers! key insight: project the (quadratic) attention matrices onto (structured) SSM matrix mixers before end-to-end training led by students @avivbick @kevinyli_Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!
- Cool demo and really nice blog post on H-Net inference: main-horse.github.io/posts/hnet-inf/ > On stage2_XL, this completely flipped. Instead of getting chunks every char, I was getting chunks after huge spans of repeats had been generated. This is a great demonstration of the power ofThis Post is from an account that no longer exists. Learn more



















