Log inSign up
Albert Gu
Cartesia
559 posts
user avatar
Albert Gu
Cartesia
@_albertgu
assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.
Joined December 2018
77
Following
21K
Followers
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Dec 4, 2023
    Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
    804K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Sep 5, 2024
    pretty sure one of these does not belong, but thanks TIME 🤔
    user avatar
    TIME
    @TIME
    Sep 5, 2024
    TIME's new cover: The 100 most influential people in AI ti.me/4dQcJ1Q
    273K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Jul 11, 2025
    Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
    user avatar
    Sukjun (June) Hwang
    @sukjun_hwang
    Jul 11, 2025
    Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
    231K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Jun 3, 2024
    excited to finally release Mamba-2!! 8x larger states, 50% faster training, and even more S's 🐍🐍 Mamba-2 aims to advance the theory of sequence models, developing a framework of connections between SSMs and (linear) attention that we call state space duality (SSD) w/@tri_dao
    120K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Nov 3, 2021
    (1/n) Excited to release 2 preprints that describe our progress on sequence modeling for long-range dependencies! arxiv.org/abs/2110.13985 (NeurIPS ‘21) arxiv.org/abs/2111.00396 We build a new class of state space models that improve perf. on the Long Range Arena by 20 points!
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Jul 8, 2025
    I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
    120K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Oct 3, 2022
    A belated announcement: I'll be joining the Machine Learning Department at CMU @mldcmu as an assistant professor starting Fall 2023! In the meantime, I'll be at @DeepMind with @NandoDF's team, working from the MTV office. Looking forward to what's ahead!
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Aug 9, 2025
    a common belief is that Transformers scale well because of less inductive bias, when it actually does have specific inductive biases. we developed H-Nets not to fix tokenization, but because I think that dynamic chunking represents a fundamental primitive that captures a bias
    user avatar
    Andrew Gordon Wilson
    @andrewgwils
    Aug 8, 2025
    A common takeaway from "the bitter lesson" is we don't need to put effort into encoding inductive biases, we just need compute. Nothing could be further from the truth! Better inductive biases mean better scaling exponents, which means exponential improvements with computation.
    72K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Oct 20, 2025
    I really like this research direction! For a long time, I've been talking about the "brain vs. database" analogy of SSMs vs Transformers. An extension of this that I've mentioned offhand a few times is that I think that the tradeoffs change when we start thinking about building
    user avatar
    Eran Malach
    @EranMalach
    Oct 17, 2025
    SSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: arxiv.org/pdf/2510.14826 🧵
    67K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Dec 12, 2023
    1/ With @tri_dao, we’re collaborating with @cartesia and @togethercomputer and we’re releasing a Mamba 3B model trained on 600B tokens on the SlimPajama dataset. Mamba scales well with data size, matching some of the strongest 3B Transformers out there.
    68K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Oct 28, 2025
    immensely proud of the team for our best model yet. grateful to be able to work with such a strong team of researchers who are always curious and willing to explore the untrodden path
    user avatar
    Karan Goel
    Cartesia
    00:00
    user avatar
    Karan Goel
    Cartesia
    50K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    May 29, 2024
    SSMs go brrrr super excited to announce the first model powered by our latest research into efficient architectures 👀 stay tuned for more details soon!
    user avatar
    Cartesia
    @cartesia
    May 29, 2024
    Today, we’re excited to release the first step in our mission to build real time multimodal intelligence for every device: Sonic, a blazing fast  (🚀 135ms model latency), lifelike generative voice model and API. Read cartesia.ai/blog/sonic and try Sonic play.cartesia.ai
    43K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Aug 20, 2024
    distillation.... mmm 🍻 state-of-the-art Mamba models with 1% of the compute, by leveraging pretrained Transformers! key insight: project the (quadratic) attention matrices onto (structured) SSM matrix mixers before end-to-end training led by students @avivbick @kevinyli_
    user avatar
    Kevin Li
    @kevinyli_
    Aug 20, 2024
    Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!
    35K
  • user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Jul 15, 2025
    Cool demo and really nice blog post on H-Net inference: main-horse.github.io/posts/hnet-inf/ > On stage2_XL, this completely flipped. Instead of getting chunks every char, I was getting chunks after huge spans of repeats had been generated. This is a great demonstration of the power of
    This Post is from an account that no longer exists. Learn more
    31K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up