i'm not "cracked"
i'm structurally solid. this egg is tough. my shell won't break. it's my armor
- Single thread to learn CUDA? Seems inefficient…
- >be me >scaling law supervisor >in charge of making sure the models do, in fact, scale >occasionally have to burn a billion dollars to check if the scaling laws still hold >one day i go to work and the benchmarks are no longer scaling >distress.jpg >ask my boss what to do
- Pass@8192 is a crazy metricWe believe formal math is the future. 🔥Introducing Kimina-Prover Preview, a Numina & @Kimi_Moonshot collaboration, the first large formal reasoning model for Lean 4, achieving 80.78% miniF2F. github.com/MoonshotAI/Kim…
- Another win for physics of language models (part 3.3)Metadata Conditioning Accelerates Language Model Pre-training "MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally
- Replying to @basedjensenActually one day you won’t be able to rinse and repeat, crazy to think about
- Replying to @stochasticchasm>he says “just scale up the model again” >i say “how” >he says “i don’t know, you’re the supervisor” >rage.jpg >quit my job >become a neurosymbolic model supervisor >first day on the job, check the scaling plots >it scales
- You can just train thingsPack it in boys
- Why is MCP stuff all over my timeline all of a sudden
- first blog post! around 2000 words, link in replies. first time writing something like this
- Replying to @jxmnopWell I feel like it’s understandable by the fact that you get more training signal from matching a probability distribution than matching a one-hot vector: less zero outputs means less zero gradients and you’ll get more training signal. You’re kinda using the large model to get


















