stochasm (@stochasticchasm) / X

stochasm

10.9K posts

stochasm

@stochasticchasm

pretraining lead @arcee_ai • 25 • opinions my own

🌖

Joined August 2024

stochasm
@stochasticchasm
Mar 2, 2025
kalomaze
@kalomaze
Mar 2, 2025
i'm not "cracked" i'm structurally solid. this egg is tough. my shell won't break. it's my armor
61K
stochasm
@stochasticchasm
Oct 31, 2025
Replying to @scaling01
Damn he got the modded 4090s
250K
stochasm
@stochasticchasm
May 9, 2025
Single thread to learn CUDA? Seems inefficient…
16K
stochasm
@stochasticchasm
Dec 2, 2024
>be me >scaling law supervisor >in charge of making sure the models do, in fact, scale >occasionally have to burn a billion dollars to check if the scaling laws still hold >one day i go to work and the benchmarks are no longer scaling >distress.jpg >ask my boss what to do
18K
stochasm
@stochasticchasm
Apr 14, 2025
Pass@8192 is a crazy metric
Jia Li
@JiaLi52524397
Apr 14, 2025
We believe formal math is the future. 🔥Introducing Kimina-Prover Preview, a Numina & @Kimi_Moonshot collaboration, the first large formal reasoning model for Lean 4, achieving 80.78% miniF2F. github.com/MoonshotAI/Kim…
41K
stochasm
@stochasticchasm
Jan 6, 2025
Another win for physics of language models (part 3.3)
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
Jan 6, 2025
Metadata Conditioning Accelerates Language Model Pre-training "MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally
27K
stochasm
@stochasticchasm
Jun 4, 2025
When you know it’s gonna be an interesting arch paper
192K
stochasm
@stochasticchasm
Dec 20, 2024
Replying to @basedjensen
Actually one day you won’t be able to rinse and repeat, crazy to think about
30K
stochasm
@stochasticchasm
Dec 2, 2024
Replying to @stochasticchasm
>he says “just scale up the model again” >i say “how” >he says “i don’t know, you’re the supervisor” >rage.jpg >quit my job >become a neurosymbolic model supervisor >first day on the job, check the scaling plots >it scales
1.9K
stochasm
@stochasticchasm
Oct 21, 2025
You can just train things
Rota 🚪🧎‍♂️
@pli_cachete
Oct 20, 2025
Pack it in boys
27K
stochasm
@stochasticchasm
Mar 6, 2025
Why is MCP stuff all over my timeline all of a sudden
18K
stochasm
@stochasticchasm
Mar 19, 2025
They totally didn't compile it
27K
stochasm
@stochasticchasm
Nov 15, 2024
first blog post! around 2000 words, link in replies. first time writing something like this
24K
stochasm
@stochasticchasm
Jan 10, 2025
Replying to @jxmnop
Well I feel like it’s understandable by the fact that you get more training signal from matching a probability distribution than matching a one-hot vector: less zero outputs means less zero gradients and you’ll get more training signal. You’re kinda using the large model to get
21K