Log inSign up
Songlin Yang
2,456 posts
user avatar
Songlin Yang
@SonglinYang4
pretraining @thinkymachines. Prev. PhD @MIT_CSAIL. she/her/hers. INTP 🐱
San Francisco, CA
sustcsonglin.github.io
Joined January 2021
3,372
Following
17.5K
Followers
  • user avatar
    Songlin Yang
    @SonglinYang4
    Jun 4, 2025
    lol my research direction is fun-based architectures
    user avatar
    stochasm
    Arcee.ai
    @stochasticchasm
    Jun 4, 2025
    When you know it’s gonna be an interesting arch paper
    170K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Sep 10, 2025
    science is better when shared ā¤ļø
    user avatar
    Thinking Machines
    @thinkymachines
    Sep 10, 2025
    Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is ā€œDefeating Nondeterminism in LLM Inferenceā€ We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
    116K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Nov 8, 2025
    Hi @JeffDean, what’s the plan for releasing the code for this line of work? None of these papers so far seem to have released any code
    user avatar
    Jeff Dean
    @JeffDean
    Nov 7, 2025
    An exciting new approach for doing continual learning, using nested optimization for enhancing long context processing.
    250K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Jan 18, 2025
    I've created slides for those curious about the recent rapid progress in linear attention: from linear attention to Lightning-Attention, Mamba2, DeltaNet, and TTT/Titans. Check it out here: sustcsonglin.github.io/assets/pdf/tal…
    107K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Jun 11, 2025
    Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
    GitHub - fla-org/flash-linear-attention: šŸš€ Efficient implementations for emerging model architec...
    From github.com
    87K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Feb 21, 2025
    Introducing the first open-source implementation of native sparse attention: github.com/fla-org/native…. Give it a spin and cook your NSA model! 🐳🐳🐳
    GitHub - fla-org/native-sparse-attention: 🐳 Efficient Triton implementations for "Native Sparse...
    From github.com
    72K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Jun 6, 2025
    Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity
    user avatar
    Han Guo
    @HanGuo97
    Jun 6, 2025
    We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
    46K
  • user avatar
    Songlin Yang
    @SonglinYang4
    May 24, 2025
    šŸ“¢ (1/16) Introducing PaTH šŸ›£ļø — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks
    arXiv logo
    arxiv.org
    PaTH Attention: Position Encoding via Accumulating Householder...
    The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
    77K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Sep 11, 2025
    Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
    user avatar
    Qwen
    @Alibaba_Qwen
    Sep 11, 2025
    šŸš€ Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! šŸ”¹ 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) šŸ”¹HybridĀ Architecture:Ā GatedĀ DeltaNetĀ +Ā GatedĀ Attention → bestĀ ofĀ speedĀ &
    86K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Sep 9, 2025
    hybrid is the future:)
    user avatar
    Joey (e/Ī»)
    @shxf0072
    Sep 9, 2025
    Qwen3-Next is hybrid GatedAttention (for outliers fix) GatedDelta net rnn for kv saving all new models will be either sink+swa hyprids like gpt oss or gated attn + linear rnn hybrids (mamba , gated deltanet etc) like qwen3-next age of pure attn for timemixing layer is over,
    77K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Oct 30, 2025
    Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
    62K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Sep 26, 2025
    math matters in scaling!
    user avatar
    Thinking Machines
    @thinkymachines
    Sep 26, 2025
    Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
    63K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Nov 8, 2025
    where is appendix and arxiv šŸ˜‚
    user avatar
    Google Research
    Google
    @GoogleResearch
    Nov 7, 2025
    Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: goo.gle/47LJrzI
    An abstract digital illustration of a brain overlaid with complex data visualizations and sound wave.
    83K
  • user avatar
    Songlin Yang
    @SonglinYang4
    Jun 13, 2025
    How can models learn to generate weight updates in token space?
    user avatar
    Jyo Pari
    @jyo_pari
    Jun 13, 2025
    What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
    51K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|Ā© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up