Log inSign up
Tri Dao
978 posts
user avatar
Tri Dao
@tri_dao
Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.
Stanford, CA
tridao.me
Joined May 2012
657
Following
41.7K
Followers
  • Pinned
    user avatar
    Tri Dao
    @tri_dao
    Jul 11, 2024
    FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
    342K
  • user avatar
    Tri Dao
    @tri_dao
    Jul 17, 2023
    Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
    903K
  • user avatar
    Tri Dao
    @tri_dao
    Jul 6, 2023
    Very excited to announce that I've finished my PhD @Stanford and will be joining @Princeton CS department as an Assistant Professor in Fall 2024. Looking forward to working with students and colleagues @PrincetonCS on ML & systems!
    296K
  • user avatar
    Tri Dao
    @tri_dao
    May 31, 2022
    Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
  • user avatar
    Tri Dao
    @tri_dao
    Dec 4, 2023
    Transformers power most advances in LLMs, but its core attention layer can’t scale to long context. With @_albertgu, we’re releasing Mamba, an SSM architecture that matches/beats Transformers in language modeling, yet with linear scaling and 5x higher inference throughput. 1/
    user avatar
    Albert Gu
    Cartesia
    @_albertgu
    Dec 4, 2023
    Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
    533K
  • user avatar
    Tri Dao
    @tri_dao
    Jun 6, 2025
    State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical
    user avatar
    Han Guo
    @HanGuo97
    Jun 6, 2025
    We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
    99K
  • user avatar
    Tri Dao
    @tri_dao
    Feb 20, 2025
    One way to tell that the AI-written kernel is wrong without even reading the code is that it's way too fast: ~1800 TFLOPS of FP32 on H100, 30x the theoretical max! If your verifier (correctness check) is even slightly wrong the model will reward-hack its way to crazy numbers
    user avatar
    main
    @main_horse
    Feb 20, 2025
    This example from their paper (pub.sakana.ai/static/paper.p…), which is claimed to have 150x speedup, is actually 3x slower if you bench it...
    205K
  • user avatar
    Tri Dao
    @tri_dao
    Jul 11, 2025
    They’ve finally done it. They got rid of tokenizers!
    user avatar
    Sukjun (June) Hwang
    @sukjun_hwang
    Jul 11, 2025
    Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
    GIF
    GIF
    56K
  • user avatar
    Tri Dao
    @tri_dao
    Jun 3, 2024
    With @_albertgu, we’ve built a rich theoretical framework of state-space duality, showing that many linear attn variants and SSMs are equivalent! The resulting model, Mamba-2 is better & faster than Mamba-1, and still matching strong Transformer arch on language modeling. 1/
    81K
  • user avatar
    Tri Dao
    @tri_dao
    Oct 13, 2023
    Announcing Flash-Decoding, to make long-context LLM inference up to 8x faster! Great collab with @d_haziza, @fvsmassa and Grigory Sizov. Main idea: load the KV cache in parallel as fast as possible, then separately rescale to combine the results. 1/7
    125K
  • user avatar
    Tri Dao
    @tri_dao
    Nov 29, 2022
    We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀! This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6 github.com/HazyResearch/f…
  • user avatar
    Tri Dao
    @tri_dao
    Feb 18, 2025
    ML algorithm design + systems optimization is the way!
    user avatar
    DeepSeek
    @deepseek_ai
    Feb 18, 2025
    🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With
    45K
  • user avatar
    Tri Dao
    @tri_dao
    Feb 24, 2025
    Love that DeepSeek is building on FlashAttention-3 code, this is why OSS can move so fast ❤️ FA3 recently enabled MLA as well, thanks to my student @tedzadouri. If you want MLA prefill & decode with full features (arbitrary page size, sliding window, rotary...), check out FA3!
    user avatar
    DeepSeek
    @deepseek_ai
    Feb 24, 2025
    🚀 Day 1 of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support ✅ Paged KV cache (block size 64) ⚡ 3000 GB/s memory-bound & 580 TFLOPS
    61K
  • user avatar
    Tri Dao
    @tri_dao
    Jun 27, 2025
    Crazy that we now have an open source model with 13B params that’s competitive w o1. And Mamba layers help bring much higher inference throughput
    user avatar
    Tencent Hy
    @TencentHunyuan
    Jun 27, 2025
    🚀 Introducing Hunyuan-A13B, our latest open-source LLM. As an MoE model, it leverages 80B total parameters with just 13B active, delivering powerful performance that scores on par with o1 and DeepSeek across multiple mainstream benchmarks. Hunyuan-A13B features a hybrid
    75K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up