Albert Gu (@_albertgu) / X

Albert Gu

559 posts

Albert Gu

@_albertgu

assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.

Joined December 2018

Following

21K

Followers

Albert Gu
@_albertgu
Dec 4, 2023
Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
804K
Albert Gu
@_albertgu
Sep 5, 2024
pretty sure one of these does not belong, but thanks TIME 🤔
TIME
@TIME
Sep 5, 2024
TIME's new cover: The 100 most influential people in AI ti.me/4dQcJ1Q
273K
Albert Gu
@_albertgu
Jul 11, 2025
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Sukjun (June) Hwang
@sukjun_hwang
Jul 11, 2025
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
231K
Albert Gu
@_albertgu
Jun 3, 2024
excited to finally release Mamba-2!! 8x larger states, 50% faster training, and even more S's 🐍🐍 Mamba-2 aims to advance the theory of sequence models, developing a framework of connections between SSMs and (linear) attention that we call state space duality (SSD) w/@tri_dao
120K
Albert Gu
@_albertgu
Nov 3, 2021
(1/n) Excited to release 2 preprints that describe our progress on sequence modeling for long-range dependencies! arxiv.org/abs/2110.13985 (NeurIPS ‘21) arxiv.org/abs/2111.00396 We build a new class of state space models that improve perf. on the Long Range Arena by 20 points!
Albert Gu
@_albertgu
Jul 8, 2025
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
120K
Albert Gu
@_albertgu
Oct 3, 2022
A belated announcement: I'll be joining the Machine Learning Department at CMU @mldcmu as an assistant professor starting Fall 2023! In the meantime, I'll be at @DeepMind with @NandoDF's team, working from the MTV office. Looking forward to what's ahead!
Albert Gu
@_albertgu
Aug 9, 2025
a common belief is that Transformers scale well because of less inductive bias, when it actually does have specific inductive biases. we developed H-Nets not to fix tokenization, but because I think that dynamic chunking represents a fundamental primitive that captures a bias
Andrew Gordon Wilson
@andrewgwils
Aug 8, 2025
A common takeaway from "the bitter lesson" is we don't need to put effort into encoding inductive biases, we just need compute. Nothing could be further from the truth! Better inductive biases mean better scaling exponents, which means exponential improvements with computation.
72K
Albert Gu
@_albertgu
Oct 20, 2025
I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about building
Eran Malach
@EranMalach
Oct 17, 2025
SSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: arxiv.org/pdf/2510.14826 🧵
67K
Albert Gu
@_albertgu
Dec 12, 2023
1/ With @tri_dao, we’re collaborating with @cartesia and @togethercomputer and we’re releasing a Mamba 3B model trained on 600B tokens on the SlimPajama dataset. Mamba scales well with data size, matching some of the strongest 3B Transformers out there.
68K
Albert Gu
@_albertgu
Oct 28, 2025
immensely proud of the team for our best model yet. grateful to be able to work with such a strong team of researchers who are always curious and willing to explore the untrodden path
Karan Goel
00:00
Karan Goel
50K
Albert Gu
@_albertgu
May 29, 2024
SSMs go brrrr super excited to announce the first model powered by our latest research into efficient architectures 👀 stay tuned for more details soon!
Cartesia
@cartesia
May 29, 2024
Today, we’re excited to release the first step in our mission to build real time multimodal intelligence for every device: Sonic, a blazing fast (🚀 135ms model latency), lifelike generative voice model and API. Read cartesia.ai/blog/sonic and try Sonic play.cartesia.ai
43K
Albert Gu
@_albertgu
Aug 20, 2024
distillation.... mmm 🍻 state-of-the-art Mamba models with 1% of the compute, by leveraging pretrained Transformers! key insight: project the (quadratic) attention matrices onto (structured) SSM matrix mixers before end-to-end training led by students @avivbick @kevinyli_
Kevin Li
@kevinyli_
Aug 20, 2024
Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!
35K
Albert Gu
@_albertgu
Jul 15, 2025
Cool demo and really nice blog post on H-Net inference: main-horse.github.io/posts/hnet-inf/ > On stage2_XL, this completely flipped. Instead of getting chunks every char, I was getting chunks after huge spans of repeats had been generated. This is a great demonstration of the power of
This Post is from an account that no longer exists. Learn more
31K