Tri Dao (@tri_dao) / X

Tri Dao

978 posts

Tri Dao

@tri_dao

Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.

Stanford, CA

Joined May 2012

Pinned
Tri Dao
@tri_dao
Jul 11, 2024
FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
342K
Tri Dao
@tri_dao
Jul 17, 2023
Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
903K
Tri Dao
@tri_dao
Jul 6, 2023
Very excited to announce that I've finished my PhD @Stanford and will be joining @Princeton CS department as an Assistant Professor in Fall 2024. Looking forward to working with students and colleagues @PrincetonCS on ML & systems!
296K
Tri Dao
@tri_dao
May 31, 2022
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
Tri Dao
@tri_dao
Dec 4, 2023
Transformers power most advances in LLMs, but its core attention layer can’t scale to long context. With @_albertgu, we’re releasing Mamba, an SSM architecture that matches/beats Transformers in language modeling, yet with linear scaling and 5x higher inference throughput. 1/
Albert Gu
@_albertgu
Dec 4, 2023
Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
533K
Tri Dao
@tri_dao
Jun 6, 2025
State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical
Han Guo
@HanGuo97
Jun 6, 2025
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
99K
Tri Dao
@tri_dao
Feb 20, 2025
One way to tell that the AI-written kernel is wrong without even reading the code is that it's way too fast: ~1800 TFLOPS of FP32 on H100, 30x the theoretical max! If your verifier (correctness check) is even slightly wrong the model will reward-hack its way to crazy numbers
main
@main_horse
Feb 20, 2025
This example from their paper (pub.sakana.ai/static/paper.p…), which is claimed to have 150x speedup, is actually 3x slower if you bench it...
205K
Tri Dao
@tri_dao
Jul 11, 2025
They’ve finally done it. They got rid of tokenizers!
Sukjun (June) Hwang
@sukjun_hwang
Jul 11, 2025
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
GIF
GIF
56K
Tri Dao
@tri_dao
Jun 3, 2024
With @_albertgu, we’ve built a rich theoretical framework of state-space duality, showing that many linear attn variants and SSMs are equivalent! The resulting model, Mamba-2 is better & faster than Mamba-1, and still matching strong Transformer arch on language modeling. 1/
81K
Tri Dao
@tri_dao
Oct 13, 2023
Announcing Flash-Decoding, to make long-context LLM inference up to 8x faster! Great collab with @d_haziza, @fvsmassa and Grigory Sizov. Main idea: load the KV cache in parallel as fast as possible, then separately rescale to combine the results. 1/7
125K
Tri Dao
@tri_dao
Nov 29, 2022
We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀! This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6 github.com/HazyResearch/f…
Tri Dao
@tri_dao
Feb 18, 2025
ML algorithm design + systems optimization is the way!
DeepSeek
@deepseek_ai
Feb 18, 2025
🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With
45K
Tri Dao
@tri_dao
Feb 24, 2025
Love that DeepSeek is building on FlashAttention-3 code, this is why OSS can move so fast ❤️ FA3 recently enabled MLA as well, thanks to my student @tedzadouri. If you want MLA prefill & decode with full features (arbitrary page size, sliding window, rotary...), check out FA3!
DeepSeek
@deepseek_ai
Feb 24, 2025
🚀 Day 1 of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support ✅ Paged KV cache (block size 64) ⚡ 3000 GB/s memory-bound & 580 TFLOPS
61K
Tri Dao
@tri_dao
Jun 27, 2025
Crazy that we now have an open source model with 13B params that’s competitive w o1. And Mamba layers help bring much higher inference throughput
Tencent Hy
@TencentHunyuan
Jun 27, 2025
🚀 Introducing Hunyuan-A13B, our latest open-source LLM. As an MoE model, it leverages 80B total parameters with just 13B active, delivering powerful performance that scores on par with o1 and DeepSeek across multiple mainstream benchmarks. Hunyuan-A13B features a hybrid
75K