Songlin Yang (@SonglinYang4) / X

Songlin Yang

2,456 posts

Songlin Yang

@SonglinYang4

pretraining @thinkymachines. Prev. PhD @MIT_CSAIL. she/her/hers. INTP 🐱

San Francisco, CA

sustcsonglin.github.io

Joined January 2021

Songlin Yang
@SonglinYang4
Jun 4, 2025
lol my research direction is fun-based architectures
stochasm
@stochasticchasm
Jun 4, 2025
When you know it’s gonna be an interesting arch paper
170K
Songlin Yang
@SonglinYang4
Sep 10, 2025
science is better when shared ❤️
Thinking Machines
@thinkymachines
Sep 10, 2025
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
116K
Songlin Yang
@SonglinYang4
Nov 8, 2025
Hi @JeffDean, what’s the plan for releasing the code for this line of work? None of these papers so far seem to have released any code
Jeff Dean
@JeffDean
Nov 7, 2025
An exciting new approach for doing continual learning, using nested optimization for enhancing long context processing.
250K
Songlin Yang
@SonglinYang4
Jan 18, 2025
I've created slides for those curious about the recent rapid progress in linear attention: from linear attention to Lightning-Attention, Mamba2, DeltaNet, and TTT/Titans. Check it out here: sustcsonglin.github.io/assets/pdf/tal…
107K
Songlin Yang
@SonglinYang4
Jun 11, 2025
Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
GitHub - fla-org/flash-linear-attention: 🚀 Efficient implementations for emerging model architec...
From github.com
87K
Songlin Yang
@SonglinYang4
Feb 21, 2025
Introducing the first open-source implementation of native sparse attention: github.com/fla-org/native…. Give it a spin and cook your NSA model! 🐳🐳🐳
GitHub - fla-org/native-sparse-attention: 🐳 Efficient Triton implementations for "Native Sparse...
From github.com
72K
Songlin Yang
@SonglinYang4
Jun 6, 2025
Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity
Han Guo
@HanGuo97
Jun 6, 2025
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
46K
Songlin Yang
@SonglinYang4
May 24, 2025
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks
arxiv.org
PaTH Attention: Position Encoding via Accumulating Householder...
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
77K
Songlin Yang
@SonglinYang4
Sep 11, 2025
Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
Qwen
@Alibaba_Qwen
Sep 11, 2025
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &
86K
Songlin Yang
@SonglinYang4
Sep 9, 2025
hybrid is the future:)
Joey (e/λ)
@shxf0072
Sep 9, 2025
Qwen3-Next is hybrid GatedAttention (for outliers fix) GatedDelta net rnn for kv saving all new models will be either sink+swa hyprids like gpt oss or gated attn + linear rnn hybrids (mamba , gated deltanet etc) like qwen3-next age of pure attn for timemixing layer is over,
77K
Songlin Yang
@SonglinYang4
Oct 30, 2025
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
62K
Songlin Yang
@SonglinYang4
Sep 26, 2025
math matters in scaling!
Thinking Machines
@thinkymachines
Sep 26, 2025
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
63K
Songlin Yang
@SonglinYang4
Nov 8, 2025
where is appendix and arxiv 😂
Google Research
@GoogleResearch
Nov 7, 2025
Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: goo.gle/47LJrzI
83K
Songlin Yang
@SonglinYang4
Jun 13, 2025
How can models learn to generate weight updates in token space?
Jyo Pari
@jyo_pari
Jun 13, 2025
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
51K