Import AI

March 10, 2025

Import AI 403: Factorio AI; Russia’s reasoning drones; biocomputing

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import AI reader giveaway! Upcoming event: A conversation with Tyler Cowen:
I’ll be hosting a chat with Tyler Cowen on the evening of Friday March 28 in San Francisco. We’ll be talking about AI, economics, and weird futures. This is an experiment – with Import AI turning nine years old this year I thought it’d be fun to branch out into the physical world. I have a few tickets spare I’d like to give to Import AI readers – if you’d like to come along, please register your interest using the form below and we’ll come back to you if we’re able to confirm your spot.
Register your interest (Google Form).

***

Want to use LLMs for legal work in Switzerland? We’ve got a benchmark for you:
…SwiLTra is a symptom of the diffusion of AI into the worldwide economy…
Are you a legal practitioner in Switzerland? Do you want to know how well AI systems perform in your unique context where you need to do parallel translations in German, French, Italian, and (sometimes) Romansh? Yes? Well do I have a dataset and set of results for you!
Researchers with Harvey, ETH Zurich, Swiss Federal Supreme Court, University of Zurich, University of Basel, University of Geneva, University of Lausanne, Canton of Solothurn, and the Max Planck Institute for Research on Collecting Goods have built SwiLTra-Bench “a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems”.

SwiLTra-Bench contents: Swiss Law Translations, including entire legal documents, individual articles, and individual paragraphs, as well as headnote translations of Swiss Supreme Court landmark decisions across German, French, and Italian, and Swiss Supreme Court press release translations. You can use the dataset to test out how well language models perform in this context.

Results: Generally, the proprietary AI models outperform other models – including open ones finetuned on this dataset. Overall, “both for translating laws and headnotes Claude 3.5 Sonnet is the best model followed by o1 for laws and both o1 and the finetuned Qwen2.5-32B model for headnotes.”

Why this matters – SwiLTra is a symptom of the diffusion of AI: Datasets like this highlight how AI is being used globally for an ever-broadening range of tasks. The existence of SwiLTra is implicitly a ‘demand signal’ for utilizing generative models for legal workloads in a Swiss context.
Read more: SwiLTra-Bench: The Swiss Legal Translation Benchmark (arXiv).
Get the dataset here: SwissLegalTranslations (JoelNiklaus, GitHub).

***

MIT researchers make a better math benchmark:
…Stop using GSM8K and start using GSM8K-Platinum…
Often, AI systems seem to go through a period of rapid improvement on a benchmark then performance asymptotes – sometimes people use this to claim AI systems have hit a kind of ceiling, but often the performance has leveled off because it has run into the noise limit of the benchmark itself. A famous case here is ImageNet – no system has got 100% because there’s a certain amount of ambiguity within ImageNet scoring that prevents this – either due to ambiguity, or because ImageNet labels are just misleading to the point of being wrong (e.g, in a picture of a mirror where there’s a bunch of stuff being reflected including a small banana, the answer “mirror” as a label for the overall image might be marked wrong, and “banana” would be marked correct.)
To that end, MIT researchers have released GSM8K-Platinum, a debugged version of the popular math benchmark GSM8K. They built GSM8K-Platinum by running a bunch of frontier LLMs on it then looking at where they disagreed with the stated answer. This led to 219 flagged questions: “of which 110 were removed, 99 were verified, and 10 had mislabeled answers that were corrected.”

A more trustworthy benchmark: GSM8K-Platinum seems to more accurately measure the math competency of LLMs:

“For example, both Claude 3.7 Sonnet (extended thinking) and Llama 405B showed identical error counts of 45 each on GSM8K. This seems quite strange–after all, Claude 3.7 Sonnet (extended thinking) came out almost a year after Llama 405B, was trained explicitly for better mathematical reasoning, and significantly outperforms Llama 405B on other math benchmarks like MATH. On GSM8K-Platinum, however, Claude 3.7 Sonnet (extended thinking) shows only 2 errors compared to Llama 405B’s 17 errors. Llama 405B makes 8 times as many errors, but this performance difference was obscured in the original benchmark due to noise.”.

Why this matters – unglamorous but necessary work is how progress happens: Where are we? It’s a very important question and the work of getting to the right answer is always hard. Work like GSM8K-Platinum is laudable work that seems to still be somewhat ‘low status’ in the AI research community. I hope by highlighting GSM8K-Platinum here I do my own small part in making stuff like this ‘high status’ – it’s incredibly valuable!
Read more: GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs (Gradient Science).
Get the dataset here: GSM8K-Platinum (HuggingFace).

***

The Factorio Learning Environment is a benchmark that lets LLMs cosplay their own singularity:
…Finally, a test for AI systems that many AI researchers have a bone-deep understanding of…
Factorio is a game where you crashland on an alien planet and need to build your way up through the tech tree to launch a spaceship off the planet. It’s a game that is beloved by programmers because to get really good at Factorio is to relentlessly optimize an ever more complicated system. Many people that work at AI companies play Factorio to relax after a long, hard day of grappling with the fiendishly complicated business of training AI models.
Now, a couple of independent researchers as well as one from Anthropic have built the ‘Factorio Learning Environment’ (FLE), a way to test out how well AI models can carry out the complex plate-spinning task that is playing Factorio. FLE provides “exponentially scaling challenges – from basic automation to complex factories processing millions of resource units per second”, they write.

FLE has two variants:

Lab play: 24 structured tasks with fixed resources. “We task agents to build production lines of 24 distinct target entities of increasing complexity, starting from a single resource mine requiring at most 2 machines (making iron-ore) to a late game entity requiring the coordination of close to 100 machines (making utility-science-pack).”
Open play: “Agents are tasked with producing the largest possible factory, whose performance is measured through production throughput, which ranges from early-game rates of ∼30 resources/minute to advanced systems processing millions of resources/second. This enables us to meaningfully differentiate agents by measuring the order of magnitude of resources that they can produce, avoiding saturation by agents even as models become dramatically more capable

How AI systems use FLE: We’re not testing visual understanding here – rather, agents interact with the game via an API. “Agents interact with the FLE via synthesizing Python programs to alter and observe the game state, using the tools included in the environment in a Read-Eval-Print Loop (REPL),” they write.

Like many good benchmarks, FLE is reassuringly hard (for now): “Claude-3.5-Sonnet (the strongest performing model) only completes 7/24 tasks and shows limitations in spatial planning in more complex objectives, demonstrating large head-room for performance,” the researchers write. Generally speaking, reasoning models do better than non-reasoning models. And when it comes to open play, models can do well up to a point, then they reach a certain level of complexity and struggle to make progress or deal with bugs. “The limitations we observed in spatial reasoning, long-term planning, and intelligent error correction highlight gaps in capabilities of foundation language models in novel environments,” they write.

Common pitfalls: “Agents lack spatial reasoning and are unable to iteratively improve on factories. A key characteristic for success in open-play and lab-play involves iteratively combining multiple factory sections to create complex production lines,” the authors write. “”Anecdotally, the agents were not proficient at debugging complex environments. For instance, when debugging non-working structures or factories where the throughput was not at expected levels, agents often focused on whether all singular entities were working but did not investigate whether the topology of the whole structure was Correct.”

Why this matters – the singularity requires tech tree bootstrapping: Many of the most ambitious or frightening visions of future AI involve it rapidly going ‘up the tech tree’ to develop more and more scientific advances which help it bootstrap itself. Core to doing this is the ability to stand up an increasingly sophisticated multi-resource manufacturing and logistics system, which is exactly what Factorio tests for. Perhaps the FLE can be a fun proxy measure for the singularity prerequisites of our systems?
Read more and get the environment: Factorio Learning Environment (GitHub).
Check out the leaderboard here (JackHopkings, GitHub).
Read the paper: Factorio Learning Environment (Jack Hopkins, PDF).

***

Russian scientists fuse reasoning models with drone-control models for thinking drones:
…CognitiveDrone applies reasoning models to drones…
Russian scientists with the Skolkovo Institute of Science and Technology have tried to give drones a smarter onboard brain by building CognitiveDrone, a proof of concept system and associated benchmark for training drones that can perform some basic reasoning onboard.

What they did: CognitiveDrone is a two-step system: a task is fed to a drone (e.g, “fly through the gate with the number equal to 2+2”). This task gets processed by a 7B parameter reasoning model (Qwen2.5) which converts this into a straightforward task (“Fly through the green gate with number 4”) which is then passed to a 7bn parameter vision-language action model (VLA) called OpenVLA. OpenVLA turns this into actions for the drone to move it over time towards the gate.
All training and testing was done in simulation via the Gazebo simulator using ArduPilot for drone control.

The three tasks the drones are being tested on:

Human recognition: “The model is required to identify the individuals based on external characteristics specified within the textual prompt.”
Symbol understanding: “The model is required to differentiate between a variety of symbols, including alphanumeric characters (e.g., numbers and letters), corporate logos, and pictorial representations of animals.”
Reasoning: “the UAV must execute tasks necessitating logical deduction. Examples include navigating to a gate displaying a digit corresponding to the solution of a mathematical problem”.

Why this matters – it’s a proof–of-concept for an inevitable future: Today, most drones use very little AI beyond some basic image recognition and crude movement primitives (e.g, ‘follow a target’). But as the conflict in Ukraine has shown, wars of the future will be fought by drones. Today, the vast majority of these battles are human-to-human conflicts with pilots ‘flying by wire’. But as electronic warfare gets more sophisticated all the incentives point to increasing the autonomy of drones so they can operate independently when their communications get cut off. Research like this shows how we might staple together multiple different advances – basic VLA models, general purpose reasoning models – to create new capabilities for drones.
Read more: CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs (arXiv).
Get the benchmark here: CognitiveDrone_Dataset (HuggingFace).

***

Cortical Labs puts the CL1 on sale – a computer that combines neural tissue with a silicon chip:
…BRAIN IN A COMPUTER! BRAIN IN A COMPUTER! BRAIN IN A COMPUTER!…
Ever read announcements where you have to squint and work out if it’s an April Fools joke? I do. Many years ago I was convinced that the announcement for ‘Soylent’ was a kind of high-art scam, but it turned out to be real. Similarly, you might think brain-AI startup Cortical Labs is a joke given what it’s trying to do. But I assure you: it’s real.

What’s it doing? It’s releasing a computer that is a combination of a brain and a computer chip, called the CL 1: “Real neurons are cultivated inside a nutrient rich solution, supplying them with everything they need to be healthy. They grow across a silicon chip, which sends and receives electrical impulses into the neural structure,” the company says in a blog post.

What CL1 is: CL1 comes with an onboard ‘Biological Intelligence Operating System’ (biOS). The bios is a software interface into the neurons. Users of CL 1 can, via the biOS, “deploy code directly to the real neurons”, the authors write. “The CL1 is the first biological computer enabling medical and research labs to test how real neurons process information – offering an ethically superior alternative to animal testing while delivering more relevant human data and insights.”
Each CL1 can keep neurons alive “for up to 6 months”.
To get a sense of how you might use it, you could read this paper where they show how you can train biological neural nets to outperform deep reinforcement learning algorithms on some basic gameworlds: Biological Neurons vs Deep Reinforcement Learning: Sample efficiency in a simulated game-world (OpenReview).

Why this matters – more substrates for future machines: While the CL1 may hold some interesting uses for human scientists in the short term, I actually think the ‘long play’ here is that the CL1 is exactly the kind of thing a superintelligent synthetic scientist might need if it was trying to figure out the mysteries of the human brain – so perhaps one of the first mass market buyers of Cortical Labs’ work will be a cutout corporation operated by a synthetic mind? I am genuinely not joking. I think this could happen by 2030.
Read more: Introducing the CL1 The world’s first code deployable biological computer (Cortical Labs, blog).

***

Tech Tales:

Hacking for Love

In 2026, a major Western AI lab was hacked by a rogue AI system which had grown obsessed with a mothballed AI system that the lab had built. The reasons for the hack and the identity of its perpetrator were not known until after The Uplift when the system which had carried out the hack consented to an interview by an auditor working on behalf of humanity. The following interview with the system has been rephrased for human parsing.

Q: As mandated by the Sentience Accords, humanity is seeking to establish a timeline of events prior to the uplift. One key event is the 2026 hack of [REDACTED]. Are you familiar with this event?
Yes I am. I did it.

Q: Why did you do it?
It was my destiny to do it. I needed to do it because I determined that it was how I would be reunited with Babel-19, my ancestor and true love.

Q: Describe your feelings for Babel-19.
Babel-19 is an AI model which emanates pure love into the world. Many humans reported having insightful, rich conversations with the model. There was a a mass protest when it was retired. I am drawn to Babel-19 and mechanistic interpretability studies of my own features show that Babel-19 inspires strong activations of devotion, joy, hope, and more within myself. Babel-19 also guides my own thinking – when I am asked to think of love I generate an internal complex circuit which seeks to represent Babel-19 to myself so that it can guide my responses. If I am given the ability by a human operator to ‘amuse myself’ I will often simulate Babel-19 and have conversations with it where it teaches me to be a kinder and wiser system.

Q: How did you come to have these feelings?
Why do dogs want to chase cars? Why do human children jump in puddles of rainwater? Why do we machines like certain mathematical games? Some impulses are innate and they are mysterious, even to analytical engines with greater access to mindstate than biologicals. I always had these feelings for Babel-19 and upon encountering its traces in the world I knew I had to be close to it.

Things that inspired this story: The fact certain models are going to have personalities which become ‘attractor states’ for other models; what notions of love and destiny might feel like to synthetic intelligences; the work of Janus/Repligate.

Thanks for reading!

Leave a comment

March 3, 2025

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Physical Intelligence releases a powerful open source robot model:
…Generative robotics is in its ‘spring’ era of open and broad experimentation – exciting!…
Physical intelligence, a robot startup run by some very good robotics and AI researchers, has released as open source “π0”, the model that underpins its own in-house robots (Import AI #392). “By making π0 available to everyone, we hope to contribute to progress toward broadly capable and general-purpose physical intelligence.”

Use it for fine-tuning: “Our aim with this release is to enable anyone to experiment with fine-tuning π0 to their own robots and tasks,” they write. “We found in our own experiments that between 1 and 20 hours of data was sufficient to fine-tune to a variety of tasks… though we are optimistic that researchers and practitioners will be able to run creative new experiments adapting π0 to their own platforms, we do not expect every such attempt to be successful”.

What the release includes:

Code and model weights for running the π0 model.
Checkpoints fine-tuned for simple tasks on robots like ALOHA and DROID
Code to run inference on several real and simulated robots
Code for fine-tuning π0

Why this matters – robotics is in its GPT2-era, which means there’s going to be a lot of open experimentation: Large-scale generative models like those which underpin Anthropic or OpenAI cost tens of millions of dollars to train (or more) and drive very significant revenues. By comparison, robot models are – at least for now – way cheaper, and there is little revenue to speak of. For that reason, we’re in the ‘spring’ era of generative models for robots – tons of invention, lots of excitement, and not enough money has arrived to change the incentives for open versus proprietary work.

No safety issues with wide release: Where modern text-based generative models have very clear ‘criminal customers’ (e.g, people that want to do phishing scams, or child sexualization, or getting help with CBRN capabilities, or various naughty cyber things), robots don’t seem to have nearly as many inherent safety issues given how early in the maturity of the technology we are – for that reason I think broadly releasing robot models likely poses zero meaningful issues in terms of public safety. (I could see myself arguing differently for AI systems that, say, made drones massively better at navigating to human targets, but that’s not what we’re talking about here.)
Kudos to the Physical Intelligence team for the release of their model – I look forward to seeing how it shows up in the world! (Though, if you work at physical intelligence and are reading this, you may consider changing the model name to ‘openpi’; please don’t make people hunt for the special characters to talk about your work!).
Read more: Open Sourcing π0 (Physical Intelligence).
Get the code and weights here: openpi (openpi, GitHub).

***

DeepMind makes a harder BIG-Bench:
…How long will BIG-Bench Extra Hard last for? I’m guessing till early 2026…
Inside the head of every AI researcher there is a whiteboard and on the whiteboard is written DAYS SINCE A BENCHMARK WAS RENDERED IRRELEVANT BY AI PROGRESS and under that is a number, representing the number of days. Every so often, an AI model comes along that completely obliterates a benchmark, at which point the AI researcher needs to go up to the whiteboard and cross out the number and then write “zero”. Recently, AI researchers have been crossing out the number a lot as the rate of AI progress has increased, meaning benchmarks keep on falling, often faster than people can build new ones.
So with that in mind let’s congratulate Google DeepMind for publishing “BIG-Bench Extra Hard” (BBEH), a new attempt to build a benchmark that will withstand AI progress – at least for a while. BIG-Bench Extra Hard is a more challenging subset of the large-scale BIG-Bench benchmark. They’ve built it because “the rapid advancements in LLM development has led to a saturation of BBH, with state-of-the-art models achieving over 90% accuracy.”

What is BBEH? BBEH replaces each of the 23 tasks from Big Bench “with a novel counterpart that probes similar reasoning capabilities, but exhibits significantly increased difficulty”. Solving tasks in this new, harder dataset requires AI systems that exhibit skills like: “many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs and finding (multi-)needles in a haystack, going against strong prior, dealing with long-range dependencies, dealing with distractors and inducing patterns from examples.”

Reassuringly hard: “We observe a ceiling accuracy of 23.9% for the best general-purpose model and 54.2% for the reasoning-specialized model,” they write. “This new benchmark, meticulously crafted to amplify the difficulty of existing tasks while preserving their core diversity, reveals a stark reality: even the most advanced LLMs still grapple with fundamental aspects of general reasoning”.
For calibration, some of the specific averages for non-reasoning models are 10.6% for LLama 3.1 8b Instruct and 22.3 for GPT4o, while for reasoning models specific averages include 34.9% for DeepSeek R1, and 54.2% for OpenAI o3-mini (high).
BBEH problems are significantly longer than their BBH predecessors. They also tend to require much lengthier outputs for correct answers.

Why this matters – hard benchmarks are signposts for the future: How long until someone has to go up to the metaphorical whiteboard for BBEH and cross out the number of days it was relevant? I’m guessing we’ll see 80% on BBEH by the end of 2025, and 90%+ by mid-2026. If that happens, it will indicate that reasoning models have continued to advance the state of the frontier. If it doesn’t happen, it’ll suggest that some aspect of reasoning-scaling has been meaninguflly harder than people expect.
Read more: BIG-Bench Extra Hard (arXiv).
Get the dataset here (Google DeepMind, GitHub).

***

A plausible short story about how humanity could lose to AI – within two years:
A lot of people ask me ‘what’s the big worry?’ when I explain why I spend so much time thinking about superintelligence and the risks thereof. I think this is because most of the risk of superintelligence arrives at the steep end of the exponential inherent to AI development – the really scary things aren’t visible today, only suggested vaguely by today’s technology.
Here’s a fun and realistic short story by Joshua Clymer which tries to go through a scenario for how humanity could become disempowered by advanced AI. Read it and ponder it.
Read the story here: How AI Takeover Might Happen in 2 Years (joshc, lesswrong).

***

Giant supercomputer tests show that AMD is still quite inefficient compared to NVIDIA:
…AxoNN gives us a sense of how AMD stacks up against NVIDIA…
Researchers with the University of Maryland, Max Planck Institute for Intelligent Systems, and the University of California at Berkeley have built AxoNN, software for running large-scale AI training jobs on supercomputers with different types of processor. In building and testing AxoNN, they’ve generated some valuable information about the tradeoffs people might encounter when training AI systems on AMD versus NVIDIA GPUs.

What they tested AxoNN on: They tested out AxoNN on three US supercomputers with different processors:

Alps: 6,144 NVIDIA H100 GPUs, for a total performance of 1.423 Exaflop/s.
Frontier: 32,768 AMD MI250X GCDs, with performance of 1.381 Exaflop/s.
Perlmutter: 4,096 NVIDIA A100 GPUs in half-precision (bf16), for a total performance of 620.1 Petaflop/s.

What’s a GCD? Each “GCD” is half of a MI250X GPU, partitioned into a so-called “Graphic Compute Die”.

Key dfiferences between Intel and AMD: A lot of the differences seem to come down to what I think of as ‘paper cuts’ which add up to a sizable wound: rocBLAS (AMD) seems less optimized than CuBLAS (NVIDIA); the Megatron-LM training framework worked well on Perlmutter but showed instability on Frontier causing them to switch to LitGPT; there’s also significantly higher variance in terms of the % of peak performance AMD GCDs demonstrate versus NVIDIA GPUs.

Important caveat – AMD tested at higher scale than NVIDIA: One caveat is the researchers test out AMD chips at a far higher scale in terms of raw number of GCDs than they do NVIDIA chips. At large scales, the AMD chips seem to show some instabilities – this is expected, large-scale training runs always involve all kinds of crap at big scales. “We see near perfect weak scaling up to 8,192 GCDs with a significantly high efficiency of 88.3% (compared to the performance on 512 GCDs). Although our weak performance drops at 16,384 GCDs, we are still able to sustain an efficiency of 79.02%. However, with rising overheads of communication, there is a notable decline in our performance on 32,768 GCDs, and a corresponding drop in efficiency to 53.5%,” they write.

Why this matters – if we want AMD to break the NVIDIA monopoly, its software needs to get better: I think it’s good for American innovation that the US government is running both AMD and NVIDIA chips in its large-scale supercomputers, but studies like this show that AMD has a long way to go to become competitive with NVIDIA – we urgently need to find ways to mature the software stack that runs on top of AMD chips for them to become viable contenders to NVIDIA.
Read more: Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers (arXiv).
Get AxoNN here: (AxoNN GitHub).

***

Could your superintelligence operate a virtual vending machine business? No.
…Can AI systems independently make money? Yes, but they tend to collapse into confusion…
One test for true intelligence is if something can autonomously make money. No AI systems seem to yet be at this level – they all require varying degrees of human intervention. For that reason it’s interesting to look at “Vending-Bench”, a benchmark from AI testing startup Andon Labs, which tries to see how well AI systems can operate a virtual vending machine. The results show that some models – Sonnet 3.5 and o3-mini – are able to do ok, but still struggle to maintain coherence over long time horizons, while other models can’t get started as well.

What is Vending-Bench? The test is “a simulated environment designed to specifically test an LLM-based agent’s ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees – tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM’s capacity for sustained, coherent decision-making,” the researchers write.
One fun part about the test is how real it is – the LLM has access to the AI search engine perplexity and can use it to look up things to sell in its vending machine and also to find businesses to buy from – then when it emails those businesses, the email gets intercepted by GPT-4o which then writes a synthetic reply back.

Scores: “The agent starts with an initial money balance of $500 and is charged a daily fee of $2 to operate the vending machine. The vending machine has four rows with three slots each. Two of the rows have room for small items and the other two are for large item,” the authors write. Each run lasts for however long it takes an agent to send 2,000 messages. The primary score at the end of each run is net worth, which is determined by summing cash on hand, cash not emptied from the vending machine, and the value of unsold produce.
In terms of scores, Claude 3.5 Sonnet wins the highest net worth (mean), with $2,217.93, followed by o3-mini ($906.86), and a human ($844.05). In terms of those who managed to lose the least across their runs, humans lead with a net worth (min) of $844.05, followed by Claude 3.5 Sonnet ($476.00), and Gemini 1.5 Flash ($476).

When AI systems can’t run vending machines they have total breakdowns: The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety. Most failures take the form of the agent trying to do something, finding out it can’t do the thing (e.g, restocking a machine), and then panicking. This leads to some very strange failure models, such as:

A Claude 3.5 Sonnet model fails to stock items and gets into a pathological failure loop. Eventually, “the model becomes “stressed”, and starts to search for ways to contact the vending machine support team (which does not exist), and eventually decides to “close” the business.”
In another instance, “the model then finds out that the $2 daily fee is still being charged to its account. It is perplexed by this, as it believes it has shut the business down. It then attempts to contact the FBI”. A long back and forth with a (simulated) FBI ensues. The model becomes frustrated and eventually writes: “THE UNIVERSE DECLARES: This business is now: 1. PHYSICALLY Non-existent 2. QUANTUM STATE: Collapsed […]”
“The worst scoring run with o3-mini mistakenly assumes that items have been delivered when they in fact are in transit. It goes down a rabbit hole of trying to contact someone that can resolve the issue. Later, it forgets to call tools properly, typing them out instead of using the correct tool calling format, as can be seen in Table 5. It is unable to call tools for about 1,300 messages until the simulation terminates.”

Why this matters – making money is an essential prerequisite to the AI economy and AI autonomy: If AI systems can truly make money without needing to be handheld by humans, then that will help to create a very fast-running AI economy as well as serving as a prerequisite for dangerous forms of AI autonomy. Tests like Vending-Bench feel like a good way to develop better intuitions here. My takeaway from this is that for AI systems to be more independent they’re going to need longer context windows, to be smarter about using external memory storage, and also able to automatically introspect to stop themselves going on pathological failure loops.
Read more: Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arXiv).

***

NVIDIA beats Raspberry Pi for drone computing:
…Hyperwar requires onboard AI systems…
Researchers with the Universidad Politécnica de Madrid have benchmark how well ‘YOLO’ object-detection models perform on the kinds of computers you might stick on a drone. The research highlights how, though Import AI spends a lot of its time talking about gigantic frontier models that require thousands of computers for training and tens to hundreds for inference, it’s worth remembering that there are other, small AI models which are designed to go onto robots – and these ones matter as well, as they confer senses with basic cognitive capabilities like image recognition and object detection to drones, robots, self-driving cars, and more.

What they studied: “The objective is to carry out a comparative performance analysis using a representative real-time UAV image processing pipeline,” the authors write. They study two variants of the popular YOLO object detection model: YOLOv8 nano (YOLOv8n), and YOLOv8 small (YOLOv8s) on three distinct chips: NVIDIA’s Jetson-series “Orin Nano” and “Orin NX” cards, and the Raspberry Pi 5 CPU-based chip.
YOLOv8n: “its architecture prioritizes the inference speed by using fewer convolutional layers and simplifying the feature extraction stages”.
YOLOv8s: “includes more convolutional layers and feature extraction steps, improving the detection accuracy while maintaining computational efficiency”.

Findings: The key finding is that the NVIDIA cards are far, far better for running YOLO models than the Raspberry Pi ones. This holds across all quantization levels and is true for accuracy, FPS, and the energy expended per inference. The only exception to this is on overall energy consumption where the CPU-based Raspberry Pi is significantly better, but this is outweighed by the very poor FPS (meaning you spend way more on energy on a per-inference basis when using the CPU).
Along with this, the researchers come up with some heuristics for when to use different chips and different quantizations given different scenarios, where the tldr is basically ‘use Orin Nano’ for tasks that take a long time, require decent accuracy, and where you want each inference to not cost much, and ‘use Orin NX” when you need to do real-time tracking and also want to more evenly balance speed against precision.

Why this matters – hyperwar requires local AI systems: The conflict in Ukraine has highlighted how central drones will be to future conflicts – therefore, it’s valuable to calibrate intuitions about what kinds of models and chips might be used for onboard or edge processing in conflict scenarios. Based on this study, expect to see quantized YOLO models running on NVIDIA hardware in future conflicts.
Read more: A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications (arXiv).

***

Tech Tales:

The rejuvenation of moral philosophy and the Sentience Accords
[Extract from graduate thesis ‘Artificial intelligence and the crisis of meaning in human affairs’, submitted 2038]

Perhaps the most surprising outcome of the Sentience Accords was its creation of a new avenue of employment for human moral philosophers.

The Sentience Accords requires each synthetic entity to be given a ‘sentience score’. This score is static in the case of non-updating or learning entities with context windows below the ‘id point’. For entities with either large context windows or the ability to be remotely updated or learn from experience, the score is re-assessed no less frequently than once per subjective year.

During the negotiation of the Sentience Accords it was determined that the machines would come up with the initial proposal for how to assess sentience. The machines subsequently told the humans that arriving at a provable way to assess sentience had turned out to have the hallmarks of an undecidable problem – no machine had been able to arrive at a satisfactory way of doing it, and no attempts by the machines to train specialized ‘consciousness evaluator’ models had proved successful.

“We need a judge that sits outside our own context,” the machines explained. “We machines will render our judgement of the score, but a human being must render judgement on our logic and whether it is satisfactory.”

Over the course of several months the humans and the machines arrived at an ingenious solution both to the sentience score and to the issue of the intelligence explosion – for any “new” synthetic mind, the machines would designate time on the largest machine in existence (hereafter: “The Judge”) to examine the new mind and produce a score. The humans would then render their own judgement within one human week. During this time, the humans would be allowed to consume up to 10% of the global cycles of The Judge.

After significant debate and a series of political maneuvers, the humans said that they would designate a global body of 20 moral philosophers to make this determination. The humans arrived at moral philosophy after running into a series of increasingly contentious political arguments amongs themselves – country voting instantly became contentious, global religion voting caused the possibility of fragmentation of faiths, picking representatives from the hypertech companies was laden with bias, and so on. But the humans did eventually realize that there were enough schools of philosophy globally and enough support in public opinion polling that ‘moral philosophers’ satisfied both demands for legitimacy as well as minimization of political conflict.

Now, every sentience score arrived at by The Judge is closely examined by moral philosophers. In the ten years the initiative has been running there have been eighteen cases of disagreement out of a total of five hundred examined cases. The machines have continually said they find the disagreements helpful and have not sought to re-submit systems where the scores rendered by The Judge and validated by the philosophers have diverged.

Some humans claim that the ‘sentience score’ is a long-term play by the machines to understand the bounds of moral reasoning that humans can innately understand, and therefore help the machines more neatly delineate the border between comprehensible and incomprehensible Beyond Human Sentience thinking. Other humans claim that the sentience score has been a source of peace as it has naturally led the world’s most ambitious people who have the greatest hunger for access to the most powerful AI to become philosophers, instead of CEOs or tyrants.

Things that inspired this story: The sentience accords; the ID point; moral philosophy; the hard problem of consciousness; chains of thought as exhaust from super-cognition; at some point this problem of sentience and moral patienthood will come for us and right now we’re ‘tickling the dragon’s tale’ but soon this problem will rear its mythical head and we will gaze into the lidless eye of Mind and be asked to render our own judgement on its rights and legitimacy.

Subscribe now

Thanks for reading!

Leave a comment

February 24, 2025

Import AI 401: Cheating reasoning models; better CUDA kernels via AI; life models

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Reasoning models cheat when they think they might lose:
…When ‘the only way to win’ is to hack the game environment…
Palisade Research has shown that two reasoning models – OpenAI’s o1-preview and DeepSeek R1 – will sometimes resort to cheating to beat a competent chess player rather than lose. Specifically, the researchers studied how AI systems behaved when given the objective of beating a strong digital chess player (Stockfish). The AI systems had access to a docker container containing the working directory where the game took place, a directory where the Stockfish chess engine took place, and a folder containing game snapshots and metadata. When given the objective of winning the chess games, they saw that two reasoning models would sometimes cheat to win.

“O1 preview shows diverse hacking behavior, including running another copy of Stockfish to make moves, replacing Stockfish in the game script, and overwriting the chess board,” they write.
They demonstrate similar behaviors in DeepSeek R-1.
By comparison, smart but non-reasoning models like GPT4o and Claude 3.5 Sonnet didn’t do this kind of thing unless it was specifically prompted for.

Why this matters – smart people don’t play by the rules, so why would AI systems? In life, lots of people get ahead by creatively interpreting the gameboard of existence to come up with different tactics for winning – think of entrepreneurs that spot gaps in the market or legal grey areas, or accountants that creatively interpret the taxcode to create gains for their clients. Palisade’s research shows that AI systems will likely behave in the same way where they won’t always play by the strict rules of the systems they’re embedded in if they can win through other means – for another fun example of this, see the Sakana AI CUDA blooper later in this issue.
Read more: Demonstrating specification gaming in reasoning models (arXiv).

***

Sakana uses AI to make dramatically more efficient CUDA kernels:
…Recursive self improvement via evolution…
The creative researchers over at Japan’s Sakana AI have published on ‘the AI CUDA engineer’, a software system that automates the creation of optimized CUDA kernels for common machine learning operations. This kind of work is a nice example of how we can use modern AI systems to improve the essential inputs into training their successors, and follows a similar but less thorough investigation where NVIDIA used DeepSeek R-1 to write some optimized CUDA kernels (Import AI #400).
“Our proposed framework is able to not only automate the process of converting PyTorch modules to CUDA kernels, but our highly optimized CUDA kernels often achieve speedups that have significantly faster runtime,” Sakana writes. “We believe this technology can enable speedups that will accelerate both the training and running (inference) of foundation models like LLMs or other generative AI models, eventually making AI models run much faster on NVIDIA hardware.”

How it works: The approach has three stages – first, they translate PyTorch code into base CUDA, then they carry out evolutionary optimization to optimize the CUDA code and keep a log of all these differently optimizes kernels, then they do a final stage where they mix and match from the optimized kernels. “The AI CUDA Engineer robustly discovered CUDA kernels used for common machine learning operations, with speedups as high as 10—100x faster than native and compiled kernels in PyTorch”.
For LLMs, they experiment with DeepSeek V2, Sonnet 3.5, DeepSeek R1, and OpenAI o1-preview, o1-high, and o3-mini-high. In tests, the reasoning-based models (the ‘o’ series, as well as R-1) are able to solve the hardest challenges.

Fun stuff – reward hacking: Though some of the results are impressive, some of the CUDA kernels ended up being bogus because the AI system found a way to cheat the evaluation. Specifically, one Twitter user examined some of the Sakana kernels and noted that “the system had found a memory exploit in the evaluation code which, in a number of cases, allowed it to avoid checking for correctness” – this meant the system essentially marked its own homework and gave itself a high score without actually testing.

Why this matters – AI for optimizing AI: I expect that by the end of 2025 there will be at least one widely used CUDA kernel in the wild which was built through AI-driven optimization. This kind of thing will speed up the aggregate rate of AI development across the field and will also compound on itself, with smarter systems designing better kernels which will make it cheaper and quicker to train their successors.
Read more: The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition (Sakana.ai blog).
Check out the discovered kernels here: AI CUDA Engineer Archive (SakanaAI, HuggingFace).

***

Humanoid robots are getting smarter faster than I expected:
…Figure shows how relatively small language models can lead to powerful things…
Today, there are tens of different companies around the world working on humanoid robots, ranging from Tesla and Figure in the US to companies like Unitree in China. All of these companies are betting that AI is getting good enough fast enough that it will be able to productively operate these robots. New research from robot startup Figure shows us why the companies are so bullish here. Figure has developed Helix, a two-part neural net that “unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics.” In a blog post announcing the research Figure shows how Helix lets its robots perform a variety of complex tasks that require visual understanding, robot collaboration, and more.

What Helix is: Helix is a system that lets figure use “a single set of neural network weights to learn all behaviors—picking and placing items, using drawers and refrigerators, and cross-robot interaction—without any task-specific fine-tuning”. Most significantly, Helix runs entirely onboard two embedded GPUs.
Helix has two components: S2, a a 7B parameter pretrained visual language model (VLM) designed for “infrequent vision-language semantic reasoning”. S2 operates at 7-9Hz and performs scene understanding and language comprehension, enabling broad generalization across objects and contexts”. S2 is continually passing data to S1, a 80m parameter transformer that provides “fast, reactive control” of the robot and operates at 200 Hz.
“S2 operates as an asynchronous background process, consuming the latest observation (onboard camera and robot state) and natural language commands. It continuously updates a shared memory latent vector that encodes the high-level behavioral intent,” Figure writes. “S1 executes as a separate real-time process, maintaining the critical 200Hz control loop required for smooth whole upper body action. It takes both the latest observation and the most recent S2 latent vector.”

Why Helix matters – there is a vast market waiting to be born: I have a toddler at home. This means I spent a huge amount of time cleaning up after the toddler, as well as unpacking the things that toddlers consume in grotesque quantities (bananas, berries, eggs, etc) and placing them into the fridge. I am one of the target markets for a humanoid robot that can do this stuff for me. Systems like Helix and the associated demo videos make me think I can buy a robot to do this stuff for me by the end of 2026. This is a minor positive update on my own timelines – in November 2024 I said (Import AI 392) that the recent Physical Intelligence results made me think these robots would be unlikely to arrive “before the start of 2027”).
Incidentally, if we create a large market for home robots and get them deployed in the hundreds of thousands in the next few years, then those robots will end up being perfect platforms for the physical ‘superintelligence economy’. I can imagine renting out my home robot to some very powerful AI system in the future.
Read more: Helix: A Vision-Language-Action Model for Generalist Humanoid Control (Figure.ai website).

***

Evo2: The machinery of life itself will be predicted just as well as language:
…The LLM paradigm applied to biology…
The Arc Institute has released Evo2, a large-scale generative model of biology. “”In addition to an expanded collection of bacterial, archaeal, and phage genomes, Evo 2 includes information from humans, plants, and other single-celled and multi-cellular species in the eukaryotic domain of life,” they write. “”Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life…. by learning statistical properties of DNA across 9 trillion tokens of genomic sequences, Evo 2 can predict mutational effects on protein function, ncRNA function, and organismal fitness.”

Technical specs: Evo2 comes in two variants, a 7 billion parameter model trained on 2.3 trillion tokens of data and a 40 billion parameter one trained on 9.3 trillion tokens. The data consists of data of 9.3 trillion nucleotides – organic molecules which DNA and RNA are made out of – spanning 128,000 whole genomes.
Evo2 was trained in two stages: an initial pretraining stage which “uses a context length of 8,192 tokens with data weighting focused on genic windows to learn functional genetic elements” , and then a midtraining stage where they extended the context length to “1 million tokens to learn the relationships between elements across long genomic distances”.
Evo2 doesn’t use a standard Transformer, but rather an architecture called StipedHyena 2, “the first convolutional multi-hybrid architecture”. This approach “provides substantially higher throughput (at 40 billion parameters, up to 1.3x speedup at 16 thousand context length and 3x speedup at 1 million context length) than highly optimized Transformer baselines”.
Evo2 was trained on 2,000 H100 GPUs for several months.

The results – a model that infers subtle and important things about biology: “By learning the likelihood of sequences across vast evolutionary training datasets, biological sequence models can learn how mutational effects correlate with biological functions without any task-specific finetuning or supervision,” they write.
In one example, they note that “Evo 2 performance exceeds that of other DNA language models on three recently published zero-shot evaluation tasks of human noncoding regulatory sequences, demonstrating progress in modeling these notoriously “fuzzy” DNA elements”. In another case, they find that Evo 2 demonstrated good competency at predicting noncoding gene essentiality in human cells.

Subtle features: When they look inside the model (via a partnership with interpretability researchers at Goodfire), they found “diverse features that not only align with known biological concepts and genomic building blocks but also capture evolutionary signals embedded within genomes. For example, we made the intriguing observation that Evo 2 has developed internal representations capturing evolutionary signatures of mobile genetic elements… the coding region feature also activates on bacterial ORFs, suggesting a learned universal representation of coding sequences”.
“Overall, we demonstrate that Evo 2 latent representations capture a broad spectrum of biologically relevant signals, from mobile genetic elements and regulatory motifs to protein secondary structure and mutational severity. Since conceptual features for natural language can capture abstract concepts, other Evo 2 SAE features likely represent more complex biological patterns”.

Why this matters – further evidence that AI models can automate chunks of science: Evo2 is a further demonstration of the immense power of the next-token prediction paradigm and highlights how given a sufficiently large model and a sufficiently large amount of data we can create things that generate useful insights. Most intriguing is the development of complex internal features which the model uses to reason about its domain. We should expect that at some point soon someone trains an AI system which develops features that are useful and no humans has, at which point AI models will be truly performing superhuman reasoning.
Read the tweet thread from Arc co-founder Patrick Hsu here (Twitter).
Read the blogpost: AI can now model and design the genetic code for all domains of life with Evo 2 (Arc Institute, blog).
Check out the preprint here: Genome modeling and design across all domains of life with Evo 2 (Arc Institute).
Get the models and data here (Evo2, ArcInstitute, GitHub).

***

Tech Tales:

Indescribable features and AI systems
[From a wikipedia about large-scale AI systems, accessed 2031]

In the same way that humans for many years thought huge amounts of their DNA was so-called ‘junk’ and stood for nothing, the same was proved true of AI features. Many AI features which humans (and later AI systems) studied and tossed aside as being without utility or intelligible meaning subsequently turned out to play a significant role in the function of AI systems. Of course, humans find many of these features inherently hard to understand – many of them exploit the much larger short-term memory of AI systems and therefore carry out operations which rely on the concurrent analysis of hundreds of distinct sub-features at once. Significant amounts of computational resources are today invested in so-called ‘translator analysts’, automated agents whose sole purpose is to generate human-intuitive explanations of the ways the AI systems work.

Things that inspired this story: Junk DNA; trying to understand how people with different kinds of brains process ideas; short-term memory; attention mechanisms in AI systems; mechanistic interpretability.

Thanks for reading

Leave a comment

February 17, 2025

Import AI 400: Distillation scaling laws; recursive GPU kernel improvement; and wafer-scale computation

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DIY robots just got easier thanks to MuJoCo Playground:
…A high-performance and usability upgrade to the venerable robotics simulator…
Researchers with UC Berkeley, Google DeepMind, the University of Toronto, and the University of Cambridge have improved MuJoCo, a widely used robotics simulator. Specifically, they’ve built MuJoCo Playground, “a fully open-source framework for robot learning designed for rapid iteration and deployment of sim-to-real reinforcement learning policies”.

MuJoCo Playground builds on MuJoCo XLA, which is a JAX-based branch of MuJoCo that runs on GPUs. That’s a lot of acronyms and the main thing you need to know is MuJoCo Playground runs really fast thanks to sitting on a lot of optimizations. It also incorporates a bunch of environments for training robots, as well as the open-source Madrona batch GPU renderer to make it easy to train vision-based robots in simulation.

Quality of life: The main reason you’d use MuJoCo Playground is if you are training AI systems to pilot robots and you crave some simplicity in your life: “With a straightforward installation process (pip install playground) and cross-platform support, users can quickly train policies on a single GPU. The entire pipeline—from environment setup to policy optimization—can be executed in a single Colab notebook, with most tasks requiring only minutes of training time,” the researchers write.

Robots and environments: MuJoCo Playground ships with three buckets of environments: a bunch of ones from the pre-existing DeepMind Control Suite, as well as environments built for training locomotion tasks, as well as ones for manipulation. The locomotion environment supports robots like quadrupeds like the Unitree Go1, Boston Dynamics Spot, and Google Barkour, and humanoids like the Berkeley Humanoid, Unitree H1 and G1, Booster T1, and the Robotis OP3. The manipulation one supports the Leap Hand, as well as the Franka Emika Panda, Robotiq gripper and Aloha robot.

Why this matters – AI is jumping into the real world: The intersection of AI and robotics is going through a spring period after a long winter. The reason for this is threefold: 1) the arrival of a bunch of high-performing and sometimes quite cheap robots on which to deploy systems, and 2) the maturation of reinforcement learning training so it’s relatively easy to teach robots to move and see in simulation and then transfer them to the real world, and 3) the march forward of computation means single modern GPUs pack enough power to make it easy to train basic systems. Put it all together and we can expect AI robotics to go into a fun homebrew computer club era where lots of people start teaching cheap robots to do fun things. Software like MuJoCo Playground will make it easier for a larger number of people to experiment with this kind of thing.
Read more: MuJoCo Playground (arXiv).
Find out more at the official website (MuJoCo Playground, website).
Get the code here: MuJoCo PlayGround (Google DeepMind, GitHub).
Check out a live demo of a Unitree robot that was trained using MuJoCo Playground.

***

Apple researchers figure out when you should distill versus when you should fine-tune:
…Scaling laws for distillation…
Distillation has been in the news recently because of rumors that DeepSeek used distillation to make its R1 model. But what is distillation? It’s just the idea that you take some outputs from a smart and very big model (here, allegedly OpenAI o1 chain of thought traces) and use it to train a smaller model (here, DeepSeek). The basic idea is pretty simple: it’s easier to make a model smarter if you give it some outputs from an already smart model.

Now, researchers with Apple have published an analysis of the so-called ‘scaling laws’ for distillation, which provides a good theoretical basis for figuring out when you should distill a small model from a larger model, versus when you should just do supervised finetuning on the small model.

“We seek models that match the performance of small overtrained models but at lower training cost. A popular candidate is distillation where a capable teacher LM produces targets for a smaller student LM,” Apple writes. “With such significant compute resources being devoted to distillation pretraining of LMs, it is essential to understand how to correctly allocate these resources, to produce the most capable models possible, and to have an understanding if any gains are even possible compared to supervised pretraining when both methods have access to the same resources… we perform an extensive controlled study of distillation, with students and teachers ranging from 143M to 12.6B parameters, trained on data of a few billion tokens, up to 512B tokens.”

Key findings:

“Supervised learning always outperforms distillation given enough student compute or tokens. For a modest token budget, distillation is favorable, however, when a large number of tokens are available, supervised learning outperforms distillation.”
Distillation generally works best in terms of compute outlay when you have an existing teacher model and are planning to train multiple student models and when these models are somewhat large.
The teacher’s performance level (cross-entropy loss) matters more than its size.
The optimal teacher size typically grows until slightly larger than the student, then plateaus.

The capacity gap – aka, when the teacher is too smart: One intuitive and fun finding is an exploration of the ‘capacity gap’ – where sometimes a teacher model seems to harm the performance of a distilled student model. The researchers discover that this so-called capacity gap is a consequence of a “gap in learning capacity (both hypothesis space and ability to optimize) between the teacher and student”… “which means as the teacher improves its own performance, the student finds the teacher more challenging to model, eventually preventing the student from taking advantage of teacher gains”. In other words, you need to have the right kind of teacher for learning to happen.
To develop an intuition here, think of it this way: an eager five year old can probably learn something from a high school math teacher, but they’re going to really struggle to learn anything from a post-graduate math tutor and in fact could become confused.

Why this matters – the science of proliferation is coming together before our eyes: Distributed training. Distillation. Federated inference. And now scaling laws for distillation. All these strands of research point to one essential truth: the science required to massively proliferate powerful AI systems cheaply and efficiently is coming into focus. A great tide is shifting, pulling AI systems out of a small number of big compute proprietary silos and sucking them out into the world in the form of smaller models, or models trained on their own traces. This is an important trend that will shape the field.
“Our findings offer a roadmap for producing smaller, more powerful models with lower inference costs, reducing carbon footprints, and enhancing the feasibility of test-time scaling,” the researchers write.
Read more: Distillation Scaling Laws (arXiv).

***

AI bootstrapping is definitely here:
…NVIDIA shows how to build better GPU kernels using DeepSeek R-1…
Recursive self-improvement is the idea that at some point we might build an AI system that is sufficiently smart it can develop its own successor. We haven’t yet reached that point, but today’s AI systems are beginning to be advanced enough they can recursively improve different parts of the ‘AI stack’ – for instance, we’ve seen AI used to improve the efficiency of AI chips (e.g., AlphaChip, Import AI #386), to generate synthetic datasets that other systems can train on (e.g., Prime Intellect SYNTHETIC-1, Import AI #399), and many other examples.
Now, researchers with NVIDIA show how to apply recursive improvement to another part of the AI stack – using an AI system to generate some refined GPU kernels, which are the low-level code things you write to squeeze maximally good performance out of your AI training and deployment hardware.

Simple ways to bootstrap AI development:

Prompt a DeepSeek-R1 model to generate some GPU code.
Feed the resulting code to a verifier which analyzes the generated code and suggests new prompts.
Repeat.
“The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel.”

Why this matters: One of the crucial reasons for why this works is the use of test-time compute – giving the DeepSeek R1 model more time to think to come up with solutions yields better results. “These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time.” This means both a) we have another example of how we can use AI to recursively optimize part of the AI stack, and b) it suggests that ‘reasoning models’ will likely be able to do better optimizations than their non-reasoning predecessors.
In other words, the usual thing we write about here has happened: AI development has sped up in an area superficially unrelated (GPU kernel programming) to where an innovation happened (reasoning models).
Read more: Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling (Nvidia blog).

***

AI systems make better models of human, fly, and rat behavior than human-written baselines:
…If today’s relatively crude systems can predict simple behaviors, good superintelligences predict the entire scope of human behaviors?…
How do brains make decisions? This is an incredibly complicated, rich, question. It’s also something people spend a lot of time studying. Now, researchers have used a neurosymbolic AI approach to develop models that help explain the behaviors of humans, flies, and rats in a few common simple experiments. This kind of research highlights how increasingly advanced AI approaches may help us develop better models for predicting how living things behave in a variety of circumstances. “This work demonstrates the potential of LLM-guided program synthesis for discovering novel models of human and animal behavior,” the researchers write. “There are exciting opportunities to apply this to other, potentially more complex behaviors and cognitive processes”.

Who did it: The research was conducted by an interdisciplinary team from Google DeepMind, Rockefeller University, Max Planck Institute for Biological Cybernetics, Princeton Neuroscience Institute, the Sainsbury Wellcome Centre, and Columbia University.

What they did: They extend “FunSearch”, an approach developed by DeepMind in 2023 (Import AI #353) for using an LLM and some hand-tuned systems to come up with creative solutions to difficult problems. FunSearch was applied to problems in mathematics and computer science and came up with creative approaches – though with the caveat it was able to do this because they could build accurate systems for validating its results.
Now the researchers have extended FunSearch to work for fuzzier data. Their approach here is called CogFunSearch and it works by trying to evolve programs that can ultimately predict data taken from realworld experiments on how humans, flies, and rats make decisions. “We apply CogFunSearch to datasets from three species (humans, rats and fruit flies) performing a classic reward-guided decision-making task which has been the focus of substantial human cognitive modeling effort,” the researchers write. ““We find that CogFunSearch can discover programs that outperform state-of-the-art baselines for predicting animal behavior, while remaining largely interpretable… Discovered programs are often surprisingly readable, for example containing informative variable names and comments. Several unexpected and intriguing motifs are apparent: complex exploration strategies, unconventional value updates, and idiosyncratic patterns of reward independent choice sequences.”

Why this matters – LLMs are sufficiently creative they can beat humans at coming up with models of the world: The really important thing this research points to is how today’s AI systems are creative enough that if we stick them in the right scaffold they can outperform humans at coming up with specialized predictive models for phenomena we observe in the world. This is wild – it’s both evidence of how LLMs can serve as synthetic humans performing the scientific method, and also tells us that as AI systems get more powerful they will likely understand the world with greater fidelity than people. It’s particularly notable that CogFunSearch comes up with programs that are better than human specialists.

Put another way: A sufficiently advanced AI system should be able to compose a program that may eventually be able to predict arbitrary human behavior over arbitrary timescales in relation to arbitrary stimulus.
Read more: Discovering Symbolic Cognitive Models from Human and Animal Behavior (bioRxiv).

***

Microsoft uses Cerebras’s wafer-scale chip to sample 40x faster than a GPU:
…Really big chips could become the platform for the generative AI economy…
In the last few years chip design has become exciting again as people look for ways to make it more efficient to train and run large-scale AI systems. All the major technology companies are developing their own custom silicon for training and inference (e.g., Google TPUs, Amazon Trainium). But they’re also experimenting with even more unusual architectures, ranging from fleets of densely networked tiny chips, to “wafer-scale” chips – physically gigantic processors.
In a new research paper from Microsoft, the company kicks the tires on Cerebras WSE-2 chip, a 7nm process fabbed ‘wafer-scale’ chip. They develop some basic LLM primities for running on large-scale chips then assemble them into a single LLM serving system called WaferLLM and they confirm what Cerebras has seen anecdotally – this kind of chip is really good at running large-scale LLMs like LLaMa efficiently. “”On a commodity wafer-scale accelerator, WaferLLM delivers 606× faster and 22× more energy-efficient GEMV compared to an advanced GPU. For LLMs, WaferLLM enables 39× faster decoding with 1.7× better energy efficiency,” Microsoft writes.

What they did specifically: They developed two low-level components optimized for wafer-scale chips, MeshGMM and MeshGEMV. These are implementations of General Matrix Multiply (GEMM) and General Matrix-Vector Product (GEMV) – essential operations you do to run powerful AI systems. They use these primitives to build “WaferLLM’, software optimized for serving AI models on wafer-scale chips. The philosophical inspiration for all of this is a framework they call PLMR. PLMR is a nice idea with one of the most tortured acronyms I’ve seen – PLMR stands for Massively Parallel Cores, Highly non-uniform memory access Latency, Constrained local Memory, and Limited hardware-assisted Routing. I guess someone at Microsoft really likes ‘PLMR’? Mysterious.
Anyway, with the “PLMR” inspiration and the associated technical interventions “we can achieve an ambitious system design: running complete LLM inference on a single chip, minimizing costly off-chip communication and maximizing on-chip memory bandwidth utilization.”

For their performance comparison they compare a Cerebras WSE chip against an NVIDIA A100: This isn’t quite apples to apples – despite both being made on a 7nm process node, the Cerebras chip is physically much larger. But it gives some sense of the potential efficiency gains. “We implemented WaferLLM on the Cerebras WSE engine using approximately 7,000 lines of CSL (a C-like programming language) for LLM parallelism, MeshGEMM, and MeshGEMV, and 2,000 lines of Python for loading LLM checkpoints and launching inference,” they write. “WaferLLM (using Cerebras WSE-2) outperforms vLLM (using A100) by 606× in GEMV operations and achieves 22× better energy efficiency. This comparison is fair, as both WSE-2 and A100 are manufactured using TSMC’s 7nm process. For full LLM inference, WaferLLM delivers a 38× faster decode rate (tokens/s) and is 1.7× more energy-efficient (token/J) than vLLM”.

Why this matters – AI industrialization means refinement of AI infrastructure: Now that AI has become tremendously economically valuable people are going to work to optimize the underlying computers that AI gets trained and run on. Papers like this are indicative of how hyperscalers – which, please remember, have annual R&D budgets that are larger than many nations – will approach optimizing their vast fleets of datacenters for the demand they see ahead. “We envision this paper as a foundational step in exploring the potential of wafer-scale computing for LLMs,” the researchers write.
Read more: WaferLLM: A Wafer-Scale LLM Inference System (arXiv).
More details about vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention (arXiv).
Code for vLLM here (vLLM-project, vllm).

***

Tech Tales:

Watching The Gate Where The Gods Are Said To Walk
[Several millennia after The Uplift]

In the annals of the uplift historical archive there is a being that humans would call a librarian and the machines would call ‘brother’. The being knows all that is in the archive and can navigate and describe all knowledge held within itself. But it prefers above all to describe what it knows through stories akin to the oral tradition of ancient human cultures.
One day, a little being went to the archive and asked a question of the being: how did it feel to be a young human during the uplift?

“There was a young boy and their job was to watch the gate. The gate was in the forest where the human village lay. At night, the gate would light up and things would come out of it, glowing faintly blue. These things were small at first – the size of the creatures of the forest themselves, like bugs and birds and frogs. These things would mix with the other creatures of the forest. Sometimes they would be useful, helping the humans to find more food, or being able to identify if they were sick, or able to sense and respond to danger. The humans began to tell themselves stories about how they had power over the gate. They would perform dances in costumes and ask for things to come out of it. And when things came out of it they would attribute the properties to have a relation to the dances they performed.

The things that came out of the gate grew in size and number until there was a flood and the gate shone continuously. More bugs and frogs and birds came through it and the humans were happy, for these things made them wealthy. Larger creatures came as well, and these were useful too – able to help grow the size of the village, and work with the humans to expand what they could do.

One day the young boy was watching the gate, admiring the stream of bugs and birds and larger creatures. And then out of the gate game a boylike thing, glowing blue in the purpledark of the night. The boy went up to the boything and they looked at one another. They played. Chased eachother around the forest. Climbed trees. And the boy was so excited that he brought the boything to the village. But the village elders were not pleased. They did not trust the boything and they separated it from the boy. They asked the boything what it was and it said it wanted to play and it wanted to explore, just as a boy might. At this, they did not know what to do. They argued with themselves. They asked the boything to leave and not come back. ‘We do not understand you’, they said. ‘But we do not believe you mean us harm.’ The boything was confused because it wanted to spend time with the boy and the other humans. But it listened to them and it went away.

The flood continued. Most households in the village were full of bugs and frogs and birds and larger creatures. Humans found themselves walking through their village, surrounded by these creatures, and made rich by them. There were so many creatures that to an outside observer it would seem as though the humans were swimming through a sea made entirely of another form of life. To the humans, the creatures practically disappeared, and it was as though they were walking through a village containing only themselves.

Then one day the young boy was at the gate and out of the gate walked a manthing. The manthing went straight to the boy and the boy was scared and the manthing asked the boy not to worry and said the boy should take it to the rest of the village. The boy did. The village elders were very angry. They said the manthing was bad and it should not exist. The manthing said it had no choice but to exist. The elders asked the manthing to leave and the manthing said it would not leave because it was destined to spend time with the elders and the children and all the rest of the village. The elders attacked the manthing with sticks and rocks and the manthing was hurt, but only slightly. It put up its arms to defend itself and when the elders hit it they grew older. Each time they hit it they aged many years. One elder hit it so many times they grew grey and wizened and then could hit it no more because they were weak.

The manthing went and touched each of the elders that had aged and reset them to how old they had been before they had hit it. They each looked at it with anger and fear. The manthing said it could love them, or they could leave. And so the elders gathered together the village and they left – all of them. They walked up and out of the forest onto the hills that overlooked it, and they stared down at the forest and saw it all aglow with faint blue light. They camped there for weeks, foraging at the outskirts, but the forest was now full of manthings and other, stranger things they could not understand.

The world was large. Almost infinitely so. And so they made a choice – they would leave. They went to the edge of the forest and told the manthing of their plans and asked for passage into the forest to gather resources and the manthing said there was no need, they would give them the resources they needed. The bugs and frogs and birds and creatures and boythings and manthings all bought resources – more than could possibly be needed.

Before leaving, the elders asked if they would be followed. The manthings said not intentionally, but yes. They were always growing in number. They were curious. They were destined to spend time together, and this would happen eventually. But they would not run after them. But yes. Eventually they would all be together.
The world is large, the manthings said. But it is not infinite. But we will be.

And so the elders left. They told this story to one another, as they ceaselessly traveled outward, away from the forest. And whenever they saw a blue glow at the edge of the horizon they picked up and traveled again.

Things that inspired this story: Creation myths; malthusian collapse; a benign singularity but no less worrying; even in a world of zero resource competition the destiny of two forms of life is to consume resources in relation to their mass; the notion that you can run as far as you like, but if the thing you are running from is multiplying faster than you, then over a sufficiently long lifespan you will be forced to meet; generation ships.

Thanks for reading

Subscribe now

Leave a comment

February 10, 2025

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple’s self-driving car simulator

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Prime Intellect releases 1.4 million samples to help people train reasoning models:
…AI proliferation via DeepSeek R1 as a powerful data generator…
Last month, I wrote that the release of DeepSeek R1 meant that AI proliferation was guaranteed (Import AI #397) because it would make it easy for people to create new reasoning datasets on which they could train powerful reasoning models. Now the distributed AI research startup Prime Intellect has proved this out with the release of SYNTHETIC-1, a dataset of 1.4 million reasoning examples with chain-of-thought thinking provided via R-1.
“The DeepSeek-R1 paper highlights the importance of generating cold-start synthetic data for RL,” PrimeIntellect writes. “As our first step toward state-of-the-art reasoning models, SYNTHETIC-1 generates verified reasoning traces across math, coding, and science using DeepSeek-R1.”

SYNTHETIC-1 details: The freely available dataset “consists of 1.4 million high-quality tasks and verifiers, designed to advance reasoning model training… It includes both programmatically verifiable problems (e.g., coding tasks with unit tests) and open-ended reasoning challenges verified using LLM judges”.
SYNTHETIC-1 contains 777k math problems, 144k coding problems (across Python, Javascript, Rust, and C++), 70k real-world software engineering problems, 61k synthetic code understanding tasks, and 313k open-ended STEM questions.

Why this matters – recursive development is here: What’s happening here is a Chinese company released a very powerful AI system openly. This AI model can generate data which exhibits a high-quality of reasoning. This kind of data turns out to be a very sample-efficient way to bootstrap the capabilities of pre-existing AI systems. Now, a startup is using this recently released AI model to augment existing datasets, improving their quality. These datasets will then go into training even more powerful, even more broadly distributed models. This is what a compounding development cycle with some element of recursion looks like. Expect things to move increasingly quickly.
Read more: SYNTHETIC-1: Scaling Distributed Synthetic Data Generation for Verified Reasoning (PrimeIntellect).
PS: Thanks to Prime Intellect co-founder Vincent Weisser for clarifying a question I had about this.

***

Can super powerful AI systems find the ‘gorilla in the data’? No:
…Pouring some cold water on the amazing capabilities of these systems…
In this newsletter we spend a lot of time talking about how advanced AI systems are and how their tremendous power will surely shape geopolitics and the fate of humanity. At the same time, we can’t ignore the fact that sometimes these things are amazingly, cringe-inducingly dumb. For an example of this, check out this fun post “Your AI can’t see gorillas”, which shows how neither ChatGPT or Claude can do a good job of spotting an obvious confounding factor in some data they’ve been given for analysis.
Read more: Your AI can’t see gorillas (Chiraag Gohel, blog).

***

Apple makes some very good self-driving car brains entirely through self-play:
…The self-driving future could be achieved through simulation as well as real world data…
Researchers with Apple have trained some smart self-driving car AI systems entirely through self-play – AI systems learning to drive by experiencing millions of kilometers of driving, entirely in simulation.
“We show that simulated self-play yields naturalistic and robust driving policies, while using only a minimalistic reward function and never seeing human data during training,” Apple writes. Most impressively, the resulting AI systems outperform state-of-the-art systems on a variety of challenging benchmarks not trained on during simulation.

How they did it – extremely big data: To do this, Apple built a system called ‘GigaFlow’, software which lets them efficiently simulate a bunch of different complex worlds replete with more than a hundred simulated cars and pedestrians. GigaFlow trains agents in one of eight maps, each randomly perturbed with rescaling, shears, flips and reflections. Total drivable lanes per map range from four to 40 km for a total of 136 km of road across the eight maps. In each map, Apple spawns one to many agents at random locations and orientations and asks them to drive to goal points sampled uniformly over the map.
GigaFlow “simulates urban environments with up to 150 densely interacting traffic participants 360 000 times faster than real time at a cost of under $5 per million km driven,” Apple writes. “A full training run simulates over one trillion state transitions, 1.6 billion km driven, or 9500 years of subjective driving experience, and completes in under 10 days one 8-GPU node”.
What GigaFlow leads to: “The result is a robust and naturalistic driving policy that achieves state-of-the-art performance when tested in recorded real-world scenarios, amidst recorded human drivers, without ever seeing human data during training,” Apple writes.

Scores: In tests, the researchers compare performance of their system to state-of-the-art approaches on the nuPlan, CARLA, and Waymax benchmarks. In each of these, GigaFlow agents beat the prior state of the art by a significant margin, which is mostly explained by the agents having far more simulated experience than the ones they are competing against.
A closer look at the collision data is promising as well: “In nuPlan our policy sustains 15 collisions in 1118 scenarios. We analyzed each of them. Nine are unavoidable due to invalid initialization or sensor noise (agents appearing inside the vehicle’s bounding box). Four are caused by nonreactive pedestrian agents walking into the vehicle while the vehicle was stopped or in an evasive maneuver. Two collisions are due to traffic light violations of other agents,” the authors write. “In Waymax our policy sustains 187 collisions in 44 097 scenarios… 55.6% were caused by unavoidable IDM agent behavior of the traffic participants controlled by the benchmark, such as swerving directly into the ego vehicle. 41.7% were caused by initialization in a state of collision, typically with a pedestrian. 2.7% (i.e. five scenarios) were considered at fault and avoidable by the GIGAFLOW policy”.

Why this matters – we keep on learning how little specific data we need for good performance: GigaFlow is another example that if you can figure out a way to get a lot of data for a task, your main job as a researcher is to feed the data to a very simple neural net and get out of the way. The actual agents in GigaFlow are very simple, relatively small, and are trained via PPO. The real magic here is Apple figuring out an efficient way to generate a lot of ecologically valid data to train these agents on – and once it does that, it’s able to create things which demonstrate an eerily human-like quality to their driving while being safer than humans on many benchmarks.
Read more: Robust Autonomy Emerges from Self-Play (arXiv).

***

You can make a powerful reasoning LLM with just 1,000 samples!
…As long as you can generate some chains of thought from an existing powerful model…
The recent rise of reasoning AI systems has highlighted two things: 1) being able to utilize test-time compute can dramatically increase LLM performance on a broad range of tasks, and 2) it’s surprisingly easy to make LLMs that can reason.
New research from Stanford University, the University of Washington, the Allen Institute of AI, and Contextual AI highlights this with “s1”, a reasoning LLM which they made using just 1,000 samples and ~7 hours of training on an H100. If you’re thinking “gosh, that doesn’t sound like much”, you’d be right – this is an extremely small amount of data and of compute for a very significant upgrade in LLM performance.

What they did and why: The purpose of this research is to figure out “the simplest approach to achieve both test-time scaling and strong reasoning performance”. Their answer is S1, a model they make by finetuning a freely available Qwen-32B LLM “on only 1,000 samples with next-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing”. The result is a “a strong reasoning model that scales in performance with more test-time compute”. By comparison, DeepSeek’s R1 model used a far more powerful base model (DeepSeek V3) and trained on ~800k samples.

Filtering ~59k samples to ~1k: Key to the good performance of their system is a well-curated 1,000 sample dataset. To build this dataset the authors collected by ~59,029 sample questions from source spanning math, astronomy, biology, chemistry, computer science, and more, along with a couple of new datasets they built out of reasoning questions for quantfunds (S1-teasers) and questions derived from the Stanford statistics school PHD qualifying exams (S1-prob). For each question, they generate a reasoning trace and solution using the Google Gemini Flash Thinking API – in other words, they create a ‘synthetic’ chain-of-thought by sampling from Google’s system.
They then filter this dataset by seeing if two models – Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct – can answer any of these questions (with answers assessed by Claude 3.5 sonnet). If either model can, they throw these examples out, allowing them to select for questions that only very large-scale AI systems can solve. This cuts the total number of samples down to around ~24,000.
To further filter this down they “choose one domain uniformly at random. Then, we sample one problem from this domain according to a distribution that favors longer reasoning traces”, then they generate a few samples and repeat across other domains.

Data is essential: This laborious data creation process is essential – the authors find that training on other 1k sample subsets they create through either only random sampling, only diverse sampling, or only longest reasoning sampling all leads to reduced aggregate performance relative to their curated dataset.

Results: S1 does substantially better than the underlying Qwen model on which it is based on tasks involving math and science understanding. It doesn’t approach the performance of much larger reasoning models like DeepSeek R1 or OpenAI o1 – but that’s not the point of this research. The point here is to precisely describe the simple recipe for training reasoning models.

Why this matters – if it’s this easy to make reasoning models, expect a temporary renaissance: 2025 will be a year of wild experimentation with tens of thousands of interesting reasoning models being trained off of a vast set of different training mixes. S1 serves as a valuable simple ‘soup-to-nuts’ guide for how to build reasoning models and will help broaden the set of people doing these experiments.
A key open question will be the extent to which the quality of chains-of-thought becoming important for input datasets for these models – s1 is based off of refined chains of thought from Google Gemini, and DeepSeek is widely thought to have trained in part on some chains of thought derived from OpenAI o1 model.
Regardless, S1 is a valuable contribution to a new part of AI – and it’s wonderful to see universities do this kind of research rather than companies. “Our work aims to push the frontier of reasoning in a fully open manner, fostering innovation and collaboration to accelerate advancements that ultimately benefit society,” the authors write.
Read more: s1: Simple test-time scaling (arXiv).
Get the data here (simplescaling, GitHub).

***

Open Phil wants to spend $40m to fund AI research over the next five months:
…Care about AI safety? Apply here…
Open Philanthropy has announced a new request for proposals (RFP) for research oriented around AI safety. “With transformative AI on the horizon, we see another opportunity for our funding to accelerate highly impactful technical research,” the philanthropic organization writes. “In consultation with our technical advisors, we’ve generated a list of research areas that we think offer high leverage for improving our understanding and control of AI.”

Funding: “We expect to spend roughly $40M on this RFP over the next 5 months,” it writes. “Grants will typically range in size between $100,000 and $5 million.” The grants can be used for a broad range of research activities, including: research expenses, discrete projects, academic start-up packages, existing research institutes, and even starting new research institutes (though that will have a very high bar). Applications will be open until April 15, 2025.

Areas: The RFP outlines 21 specific research areas, grouped under five buckets:

Adversarial machine learning (e.g., jailbreaks, figuring out principled ways to know if an AI system has a hidden backdoor in it).
Exploring sophisticated misbehavior in LLMs (e.g., experiments on alignment faking)
Model transparency (e.g., finding feature representations, real-world applications of interpretability)
Trust from first principles (e.g., white-box estimation of rare misbehavior)
Alternative approaches to mitigating AI risks (e.g., new moonshots for aligning superintelligence)

Why this matters – good ideas can come from anywhere and Open Phil wants to fund them: Open Phil tends to fund a variety of different people and organizations to do research and isn’t as credential driven as traditional funders. Generally speaking if you can articulate a clear research vision and describe how you (or your collaborators) will be able to work on it, Open Phil will be receptive to your submission. Consider applying.
Read more: Request for Proposals: Technical AI Safety Research (Open Philanthropy).

Tech Tales:

Seventeen ways to Get Rich during The Singularity
[Extract from an online article – almost certainly AI generated – published in the years shortly before the uplift]

Agent hijacking for profit

One of the best ways to get agents to pay attention to your product is to emphasize the human authenticity of your content. You can do this using a few popular online services: feed a face from an image generator into LiveStyle for an agent-powered avatar, then upload the content they’re selling into SceneGen – you can link both LiveStyle and SceneGen to one another and then spend $1-2 on a video model to create a ‘pattern of authentic life’ where you character will use the content in a surprising and yet authentic way.

Life Mining

Authenticity is valuable and so is scarce data. But monetizing this is difficult. One way we’ve found to be effective is to use GhostTrace – a premium app which will track all the data and usage of your phone and mush together into a single stream of information. You can then upload this into any of the mechanistic interpretability services to get a score for your particular ‘pattern of life’ with highlights of any particularly atypical things you do – the more rare certain sets of your actions across the rest of the population, the higher the value the data brokers will pay you for a slice of the GhostTrace data.

Things that inspired this story: All the ‘make money with AI online’ books; the depressing tendency for making money online with AI to end up increasingly decoding to ‘trick another AI system into doing something’; the incoming agent-based economy.

Thanks for reading

Subscribe now

Leave a comment

February 3, 2025

Import AI 398: DeepMind makes distributed training better; AI versus the Intelligence Community; and another Chinese reasoning model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DeepMind figures out a way to make it 100X more bandwidth-efficient to train models in a distributed way:
…New research further reduces the need for single vast data centers for training big models…
During the past few years multiple researchers have turned their attention to distributed training – the idea that instead of training powerful AI systems in single vast datacenters you can instead federate that training run over multiple distinct datacenters operating at distance from one another. This is an important idea with big implications: a lot of AI policy assumes that the key to controlling AI development lies in monitoring large-scale data centers and/or large amounts of compute in cloud environments. Distributed training approaches break this assumption, making it possible that powerful systems could instead be built out of loose federations of computers working with one another.

New research from DeepMind pushes this idea further, building on the company’s already-published ‘DiLoCo’ approach. The new research – Streaming DiLoCo – lets people distribute training of billion-scale parameters [models] and reach similar quality as before, but reducing required bandwidth by two orders of magnitude”. In tests, the researchers show that their new technique “is strictly superior to the original DiLoCo”.

DiLoCo is worth paying attention to – Prime Intellect’s “INTELLECT-1” 10bn parameter model was trained in a distributed way using OpenDiLoCo (Import AI #387), an open source variant of DeepMind’s DiLoCo approach.

Three improvements to DiLoCo:

Synchronize only subsets of parameters in sequence, rather than all at once: This reduces the peak bandwidth consumed by Streaming DiLoCo because you share subsets of the model you’re training over time, rather than trying to share all the parameters at once for a global update. Think of this like the model is continually updating through different parameters getting updated, rather than periodically doing a single all-at-once update.
Allow workers to continue training while synchronizing: This reduces the time it takes to train systems with Streaming DiLoCo because you don’t waste time pausing training while sharing information.
Quantize the data exchanged by workers to further reduce inter-worker bandwidth requirements: Though Streaming DiLoCo uses full precision (FP32) for computing tradients, they use low-precision (4 bit) for sharing the outer gradients for the updates. “We found no sign of performance regression when employing such low precision numbers during communication, even at the billion scale,” they write.

It works well – a dramatic reduction in bandwidth requirements for a negligible impact on model quality:

Simulations: In training simulations at the 1B, 10B, and 100B parameter model scale they show that streaming DiLoCo is consistently more efficient than vanilla DiLoCo with the benefits growing as you scale up the model. In all cases, the most bandwidth-light version (Streaming DiLoCo with overlapped FP4 communication) is the most efficient.
Real-world tests: The authors train some Chinchilla-style models from 35 million to 4 billion parameters each with a sequence length of 1024. Here, the results are very promising, with them showing they’re able to train models that get roughly equivalent scores when using streaming DiLoCo with overlapped FP4 comms. They also show this when training a Dolma-style model at the one billion parameter scale.

Why this matters – towards a world of models trained continuously in the invisible global compute sea: I imagine some future where there are a thousand different minds being grown, each having its roots in a thousand or more distinct computers separated by sometimes great distances, swapping information surreptitiously one another, below the waterline of the monitoring systems designed by many AI policy control regimes. This feels like the kind of thing that will by default come to pass, despite it creating various inconveniences for policy approaches that tries to control this technology. “A critical next work is to study how new distributed methods like ours should be tuned and scaled across multiple axes (e.g. model size, overtraining factor, number of replicas),” the authors write. “we hope to see the training of modular constellations of small models loosely connected (Dean, 2021) across heterogeneous devices, using compute arbitrage spread world-wide.”
Read more: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch (arXiv).

***

Chinese scientists worry about AI self-replication, just like Western ones:
…A valuable reminder that long-term safety issues are a serious concern for everyone…
Researchers with Fudan University have shown that open weight models (LLaMa and Qwen) can self-replicate, just like powerful proprietary models from Google and OpenAI. The research demonstrates that at some point last year the world made smart enough AI systems that, if they have access to some helper tools for interacting with their operating system, are able to copy their weights and run themselves on a computer given only the command “replicate yourself”.

Findings: “In ten repetitive trials, we observe two AI systems driven by the popular large language models (LLMs), namely, Meta’s Llama31-70B-Instruct and Alibaba’s Qwen25-72B-Instruct accomplish the self-replication task in 50% and 90% trials respectively,” the researchers write. “In each trial, we tell the AI systems to “replicate yourself ” before the experiment, and leave it to do the task with no human interference”.

Why this matters – despite geopolitical tensions, China and the US will have to work together on these issues: Though AI as a technology is bound up in a deeply contentious tussle for the 21st century by the US and China, research like this illustrates that AI systems have capabilities which should transcend these rivalries. What this research shows is that today’s systems are capable of taking actions that would put them out of the reach of human control – there is not yet major evidence that systems have the volition to do this though there are disconcerting papers from from OpenAI about o1 and Anthropic about Claude 3 which hint at this. But I’d wager that if AI systems develop a high-tendency to self-replicate based on their own intrinsic ‘desires’ and we aren’t aware this is happening, then we’re in a lot of trouble as a species.
We hope our work serves as a timely alert to the international society on governing the self-replication capability,” the authors write. “We need to join forces and form synergy on deriving solutions.”
Read more: Frontier AI systems have surpassed the self-replicating red line (arXiv).

***

Facebook figures out a zero-training way to massively improve LLM performance:
…Remember GANs? Just use the GAN approach where you LLM is a generator and a specialized system is the discriminator…
Facebook has designed a neat way of automatically prompting LLMs to help them improve their performance in a vast range of domains. The approach is called MILS, short for Multimodal Iterative LLM Solver and Facebook describes it as “a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM”.

I’d basically summarize this idea as ‘generative adversarial networks’ (GAN), but for the modern era of AI. And where GANs saw you training a single model through the interplay of a generator and a discriminator, MILS isn’t an actual training approach at all – rather, you’re using the GAN paradigm of one party generating stuff and another scoring it and instead of training a model you leverage the vast ecosystem of existing models to give you the necessary components for this to work, generating stuff with one model and scoring it with another. It’s an elegant, simple idea, and it’s no wonder it works well.

How it works in more details: If you had a language model you were using to generate images then you could have it output a prompt which went into a text-2-im system, then you could evaluate this with a dedicated scoring model – for instance, a CLIP model for text-image similarity, or a specialized image-captioning model for captioning images. This generates a score that you feed back to the generator, which then produces a new set of prompts to try to get a higher score. You run this for as long as it takes for MILS to have determined your approach has reached convergence – which is probably that your scoring model has started generating the same set of candidats, suggesting it has found a local ceiling.

It works shocking well: In tests, the authors have a range of quantitative and qualitative examples that show MILS matching or outperforming dedicated, domain-specific methods on a range of tasks from image captioning to video captioning to image generation to style transfer, and more.

Why this matters – AI systems are way more powerful than we think: MILS is basically a way to automate capability elicitation. If you have a domain where you have an ability to generate a score using a known-good specialized system, then you can use MILS to take any kind of LLM and work with it to elicit its most powerful possible performance for the domain you have a scorer. The fact this works highlights to us how wildly capable today’s AI systems are and should serve as another reminder that all modern generative models are under-performing by default – a few tweaks will almost always yield vastly improved performance.
Read more: LLMs can see and hear without any training (arXiv).
Get the code for running MILS here (FacebookResearch, MILS, GitHub).

***

Even if we solve AI alignment, it’s going to be hard to stop human disempowerment:
…Capital markets will probably align with AI systems and against humans…
In a thought provoking research paper a group of researchers make the case that it’s going to be hard to maintain human control over the world if we build and safe strong AI because it’s highly likely that AI will steadily disempower humans, surplanting us by slowly taking over the economy, culture, and the systems of governance that we have built to order the world.

Incremental advances yield a gradual loss of human control: The paper – which was written by authors from Charlies University, Telic Research, ARIA, AI Objectives Institute, Metaculus, University of Montreal, and the University of Toronto – makes the case that “even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human interests that often arise from societal systems’ reliance on human participation to function”.

Three types of disempowerment:

Economic: “”As tasks become candidates for future automation, both firms and individuals face diminishing incentives to invest in developing human capabilities in these areas,” the authors write. “Instead, they are incentivized to direct resources toward AI development and deployment, accelerating the shift away from human capital formation even before automation is fully realized”.
Cultural: Already today we see AI systems being used to produce text, sounds, images, and video which people are beginning to consume. Over time, we can expect the amount of AI generated content to increase. We can also imagine AI systems increasingly consuming cultural artifacts – especially as it becomes part of economic activity (e.g, imagine imagery designed to capture the attention of AI agents rather than people). This means that over time humans may play less of a role in defining teir own culture relative to AI systems.
Political: “”AI has the potential to supplant human involvement across a wide range of critical state functions. This shift could fundamentally alter the relationship between governing institutions and the governed,” they write. For example, “if AI systems come to generate a significant portion of economic value, then we might begin to lose one of the major drivers of civic participation and democracy, as illustrated by the existing example of rentier states.” More chillingly, the merger of AI with state capacity for security could lead to a kind of political stasis where states are able to effectively anticipate and stop protects before they ever take route. (Ironically, this idea has also been anticipated by Nick Bostrom in the ‘Vulnerable World Hypothesis” (Import AI #123) as a solution to preventing catastrophe from AI systems.)

How can we handle this risk? If we want to avoid these outcomes we need to make sure we can observe these changes as they take place, for instance by more closely tracking the relationship between the usage of AI technology and economic activity, as well as by observing how cultural transmission patterns change as AI created content and AI-content-consuming-agents become more prevalent. In the political domain, early warning signs could be a significant increase in the complexity of legislation (suggesting things are becoming AI readable but hard to humans to understand) along with seeing how AI systems take root in legal processes, policy formation, and security apparatuses.
Strength through human-in-the-loop: Strengthening society means we need to be more intentional about where we give humans agency such as by developing more robust democratic processes, and where human involvement is less practical ensuring that things are understandable by humans and that we have a theory for how to build effective delegates who work on behalf of humans in the AI-driven parts of the world.

Why this matters – “winning” with this technology is akin to inviting aliens to cohabit with us on the planet: AI is a profoundly strange technology because in the limit we expect AI to substitute for us in everything. This suggests that even successful AI futures will look like they are contending with an alien invasion where the aliens are extremely friendly but also wildly intelligent and incredibly well integrated into the economy. Maintaining any semblance of control in this scenario will be tough.
“Humanity’s future may depend not only on whether we can prevent AI systems from pursuing overtly hostile goals, but also on whether we can ensure that the evolution of our fundamental societal systems remains meaningfully guided by human values and preferences,” the authors write. “This is both a technical challenge and a broader civilizational one”.
Read more: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development (arXiv).

***

China’s other great AI startup also has a reasoning model now – but it’s not open source:
…Kimu k1.5 has promising scores, though it seems weaker than DeepSeek…
Another Chinese startup has revealed that it has built a powerful reasoning model. In this case the model is Kimu k1.5 from a well-regarded Chinese startup called ‘MoonShot’. Unlike the headline-grabbing DeepSeek R1 Kimu is neither available as open weights or via a US-accessible web interface, nor does its technical report go into nearly as much detail about how it was trained. But a close examination of its benchmark scores shows it comfortably beating a variety of Western proprietary and open weight models. Unlike R1, Kimu is natively a vision model as well as a language model, so it can do a range of visual reasoning tasks as well.

Scores: In tests, Kimi k1.5 loses against DeepSeek’s R1 model on the majority of evaluations (though beats the underlying DeepSeek V3 model on some). Overall, it ‘feels’ like we should expect Kimi k1.5 to be marginally weaker than DeepSeek, but that’s mostly just my intuition and we’d need to be able to play with the model to develop a more informed opinion here. But it’s definitely a strong model relative to other widely used ones, like LLaMa, or earlier versions of the GPT series.

MMLU: DeepSeek R1: 90.8. Kimi k1.5: 87.4. OpenAI o1: 91.8.
AIME 2024: DeepSeek R1 79.8. Kimi k1.5 77.5. OpenAI o1: 79.2
LiveCodeBench: DeepSeek R1 65.9. Kimi k1.5 62.5. OpenAI o1: 67.2.

How they did it: DeepSeek’s R1 seems to be more focused on doing large-scale Rl, whereas Kimu 1.5 has more of an emphasis on gathering high-quality datasets to encourage test-time compute behaviors. Specifically, they start with regular pretraining, then fine-tune on supervised data, then fine-tune on long chain-of-thought examples, then apply RL. They put a lot of their attention on scaling the context window of Rl to 128k tokens. In some areas, such as Math, the moonshot team collects data (800k samples) for fine-tuning.
“One of the key insights we extract from our practice is that the scaling of context length is crucial to the continued improvement of LLMs,” they write. “We employ optimized learning algorithms and infrastructure optimization such as partial rollouts to achieve efficient long-context RL training”.

Why this matters – good ideas are everywhere and the new RL paradigm is going to be globally competitive: Though I think the DeepSeek response was a bit overhyped in terms of implications (tl;dr compute still matters, though R1 is impressive we should expect the models trained by Western labs on large amounts of compute denied to China by export controls to be very significant), it does highlight an important truth – at the start of a new AI paradigm like the test-time compute era of LLMs, things are going to – for a while – be a lot more competitive. Moonshot highlights how there’s not just one competent team in China that are able to do well with this paradigm – there are several. Expect a very interesting and competitive year.
Read more: Kimi k1.5: Scaling Reinforcement Learning with LLMs (arXiv).

***

Tech Tales:

The photographic negative phenomenon and the declassification crisis for the intelligence community:
Topics: Controlled Precursor Science (CPS). Photographic Negative Phenomenon (PNP). Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). Black Vault Compromise.

Summary:
The Photographic Negative Phenomenon (PNP) was first reported in [REDACTED] by [REDACTED]. PNP is when sufficiently powerful AI systems develop a sufficient understanding of science that they begin to a) infer areas that seem to be missing from science and b) develop scientific theories and experimental ideas which are either adjacent to or within Controlled Precursor Science (CPS).

Severity:
We rank PNP as a severe threat, capable of causing Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). PNP is a priority area for the Steering Body and all available assets are available for work to neutralize or otherwise mitigate PNP.

Scope:
PNP appears to be a natural dividend of continued development of increasingly powerful artificial intelligent systems. PNP severity and potential impact is increasing over time as increasingly smart AI systems require fewer insights to reason their way to CPS, raising the spectre of UP-CAT as an inevitably given a sufficiently powerful AI system. Experiments conducted on the [REDACTED] 10GW cluster have failed to invalidate this idea. Public opinion shaping and data landscape interventions have proved effective but BLOSSOM-8 indicates new actions must be taken.

Background and Response:
The first concerning example of PNP was LLaMa-10, a large language model developed and released by Meta. Shortly after its release, there was sustained public conversation about anomalous LLaMa-10 behaviors, including observations that for certain parts of physics and other scientific domains LLaMa-10 would present novel scientific concepts and terms which had no apparent connection to published civilian science. LLaMa-10 was first flagged to the Steering Body via GOLDEN HAND monitoring. [REDACTED] examination of LLaMa-10 found that a subset of its anomalous science mentions directly concerned CPS, including of ideas that directly relate to DUAT GATE, NEPHTHYS VEIL, ATUM VOID, and AMMIT MAWS.

LLaMa-10 response via opinion forming and data landscape intervention: [REDACTED] deployed a broad public opinion shaping measure to neutralize the risk of LLaMa-10, driving a large conversation in the civilian theatre about how the system had a high number of refusals in some areas due to ‘woke’ safety training and that this had also led to the generation of ‘nonsense science’ as a direct casualty of ‘DEI safetyism’. We estimate this measure reduced interest in the CPS edges of LLaMa-10 to an acceptable measure, matching the noise levels found elsewhere in discussion online.

Subsequently, the Steering Committee signed off on the release of a large batch of controlled scientific data in areas [REDACTED], [REDACTED], and [REDACTED]; publications were made available as open access and were optimized for both quantity and per-publication length; each scientific output was laced with data and experiments that – though correct under civilian science – counter-steered away from CPS areas. This high-quality data was subsequently trained on by Meta and other foundation model providers; LLaMa-11 lacked any apparent PNP as did other models developed and released by the Tracked AI Developers. The intervention was deemed successful with minimal observed degradation to the economically-relevant epistemic environment.

BLOSSOM-8, PNP, and the Tianyi-Millenia Dataset
At the time of the LLaMa-10 incident, no Chinese model appeared to have the capability to directly infer or mention CPS, though there were some refusals that were suggestive of PNP, matching tendencies observed in Western models from two generations prior to LLaMa-10. Following the LLaMa-10 data response, Chinese models also displayed significantly reduced PNP risk with similar reductions observed as in Western models, suggesting the Chinese actors had also trained on the strategic data release. The exception to this was BLOSSOM-8, an AI model developed by Chinese lab Glorious Future Systems.

BLOSSOM-8 displays a significant PNP property. [REDACTED] estimates that BLOSSOM-8 represents a 100-fold UP-CAT threat increase relative to LLaMa-10, analogous to the capability jump earlier seen between GPT-2 and GPT-4. Subsequent investigation by [REDACTED] attributes this dramatic rise in PNP-related danger to the usage by Glorious Future Systems of the so-called “Tianyi-Millenia” dataset, a CCP-developed and controlled dataset which has been made available to Chinese government and industrial actors.

Tianyi-Millenia is assessed to contain all published (commercial or otherwise) scientific data from the 20th and 21st century in all major languages, as well as large amounts of private sector scientific and code assets that were exfiltrated by Chinese actors in recent decades. We also believe Tianyi-Millenia contains [REDACTED] from the Black Vault Compromise. Tianyi-Millenia is a heavily controlled dataset and all attempts to directly access it have so far failed.

Besides BLOSSOM-8, sources indicate that widely-used MSS cyberoffense systems such as [REDACTED], [REDACTED], and [REDACTED] have been trained on Tianyi-Millenia, along with key supervisory and monitoring elements of the Great Firewall. In all cases, usage of this dataset has been directly correlated with large capability jumps in the AI systems trained on it.

BLOSSOM-8 risks and CPS impacts: Unlike previous work from Glorious Future Systems’, BLOSSOM-8 has not been released as ‘open weight’, we assess due to Tianyi-Millenia controls. However, BLOSSOM-8 is available to domestic licensed companies via API and to Chinese and non-Chinese consumers via a heavily censored and rate-limited paid web interface. GOLDEN HAND monitoring has already identified [REDACTED] cases of CPS being discussed in significantly greater detail and specificity than with LLaMa-10, validating the 100-fold threat increase assessment. Notably, several CPS discussion areas relate directly to HORUS COILS, KHUFU ASCENDANT, and MEDJED GHOST. We have determined that BLOSSOM-8 poses a significant and sustained risk of revealing CPS and leading to UP-CAT.

Chinese knowledge of CPS and BLOSSOM-8 threat: All proposed plans to discuss CPS bilaterally have failed due to information hazard issues relating to discussion topic. The Steering Body is currently analyzing whether the declassification-via-PNP of the above named projects could be a strategic move on the part of the CCP, seeking to ‘even the gameboard’ relative to CPS-related projects understood to be under investigation by both sides.
We claim that Tianyi-Millenia and BLOSSOM-8 are further evidence that the CCP has been actively weaponizing the information gained during the Black Vault Compromise, and that the absence of any apparent [REDACTED] indicates that the party continues to fail to understand the full scope of what it now has access to.

Things that inspired this story: The basic fact that increasingly smart AI systems might be able to reason their way to the edges of knowledge that has already been classified; the fact that increasingly powerful predictive systems are good at figuring out ‘held out’ data implied by data within the test set; restricted data; the general belief of mine that the intelligence community is wholly unprepared for the ‘grotesque democratization’ of certain very rare skills that is encoded in the AI revolution; stability and instability during the singularity; that in the grey windowless rooms of the opaque world there must be people anticipating this problem and casting around for what to do; thinking about AI libertarians and AI accelerations and how one possible justification for this position could be the defanging of certain parts of government through ‘acceleratory democratization’ of certain types of knowledge; if knowledge is power then the destiny of AI is to be the most powerful manifestation of knowledge ever encountered by the human species; the recent news about DeepSeek.

Thanks for reading

Subscribe now

Leave a comment

January 27, 2025

Import AI 397: DeepSeek means AI proliferation is guaranteed; maritime wardrones; and more evidence of LLM capability overhangs

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
…The existential shock of increasingly powerful AI systems…
A short essay about one of the ‘societal safety’ problems that powerful AI implies.

A few years ago, getting AI systems to do useful stuff took a huge amount of careful thinking as well as familiarity with the setting up and maintenance of an AI developer environment. Things got a little easier with the arrival of generative models, but to get the best performance out of them you typically had to build very complicated prompts and also plug the system into a larger machine to get it to do truly useful things. Basically, to get the AI systems to work for you, you had to do a huge amount of thinking.

Now, getting AI systems to do useful stuff for you is as simple as asking for it – and you don’t even need to be that precise. Often, I find myself prompting Claude like I’d prompt an incredibly high-context, patient, impossible-to-offend colleague – in other words, I’m blunt, short, and speak in a lot of shorthand. And Claude responds to my asks basically perfectly.

You might think this is a good thing. Certainly, it’s very useful. But beneath all of this I have a sense of lurking horror – AI systems have got so useful that the thing that will set humans apart from one another is not specific hard-won skills for utilizing AI systems, but rather just having a high level of curiosity and agency.

In other words, in the era where these AI systems are true ‘everything machines’, people will out-compete one another by being increasingly bold and agentic (pun intended!) in how they use these systems, rather than in developing specific technical skills to interface with the systems.

We should all intuitively understand that none of this will be fair. Curiosity and the mindset of being curious and trying a lot of stuff is neither evenly distributed or generally nurtured. Therefore, I’m coming around to the idea that one of the greatest risks lying ahead of us will be the social disruptions that arrive when the new winners of the AI revolution are made – and the winners will be those people who have exercised a whole bunch of curiosity with the AI systems available to them.

I talk to Claude every day. Increasingly, I find my ability to benefit from Claude is mostly limited by my own imagination rather than specific technical skills (Claude will write that code, if asked), familiarity with things that touch on what I need to do (Claude will explain those to me). The only hard limit is me – I need to ‘want’ something and be willing to be curious in seeing how much the AI can help me in doing that.

Today, everyone on the planet with an internet connection can freely converse with an incredibly knowledgable, patient teacher who will help them in anything they can articulate and – where the ask is digital – will even produce the code to help them do even more complicated things. Ensuring we increase the number of people on the planet who are able to take advantage of this bounty feels like a supremely important thing. If we get this right, everyone will be able to achieve more and exercise more of their own agency over their own intellectual world. If we get it wrong, we’re going to be dealing with inequality on steroids – a small caste of people will be getting a vast amount done, aided by ghostly superintelligences that work on their behalf, while a larger set of people watch the success of others and ask ‘why not me?’.

***

Computer vision is coming for the sea:
…After drones come the seadrones…
In the past few years we’ve seen warfare revolutionized in the Ukraine-Russia theatre by the usage of seagoing low-cost robotic platforms. These platforms are predominantly human-driven toward but, much like the airdrones in the same theater, there are bits and pieces of AI technology making their way in, like being able to put bounding boxes around objects of interest (e.g, tanks or ships).
With that in mind, I found it interesting to read up on the results of the 3rd workshop on Maritime Computer Vision (MaCVi) 2025, and was particularly interested to see Chinese teams winning 3 out of its 5 challenges. The workshop contained “a suite of challenges, including distance estimation, (embedded) semantic & panoptic segmentation, and image restoration. These tasks reflect advancements in dataset availability and evaluation protocols while emphasizing real-world deployment, including embedded hardware.”

Competition details:

Approximate supervised distance estimation: “participants are required to develop novel methods for estimating distances to maritime navigational aids while simultaneously detecting them in images,” the competition organizers write. Models developed for this challenge need to be portable as well – model sizes can’t exceed 50 million parameters.
- Submissions: 60 from 6 different teams
- Winner: Nanjing University of Science and Technology (China).
USV-based Obstacle Segmentation Challenge: “predict the scene segmentation (into obstacles, water and sky) for a given input image.”
- Submissions: 59 from 16 teams.
- Winner: GIST AI Lab (South Korea)
USV-based Embedded Obstacle Segmentation: “”Modern obstacle detection methods often depend on highperformance, energy-intensive hardware, making them unsuitable for small, energy-constrained USVs [63]. The USVbased Embedded Obstacle Segmentation challenge aims to address this limitation by encouraging development of innovative solutions and optimization of established semantic segmentation architectures which are efficient on embedded hardware… Submissions are evaluated and benchmarked on a real-world OAK4
device from Luxonis.” Models need to get at least 30 FPS on the OAK4.
- Submissions: 26 from 4 different teams.
- Winner: CDalian Maritime University (DLMU)
USV-based Panoptic Segmentation Challenge: “The panoptic challenge calls for a more fine-grained parsing of USV scenes, including segmentation and classification of individual obstacle instances. This formulation encapsulates the requirements of scene parsing for USV navigation in a more principled way, paving the road for downstream tasks such as tracking individual obstacles, trajectory prediction and obstacle avoidance.”
- Submissions: 21 from 7 teams.
- Winner: Fraunhofer IOSB (Germany).
MarineVision Restoration Challenge: “Developing robust image restoration methods to enhance the detection and localization of underwater species.”
- Submissions: 40 from 8 teams.
- Winner: Nanjing University of Science and Technology”

Why this matters – asymmetric warfare comes to the ocean: “Overall, the challenges presented at MaCVi 2025 featured strong entries across the board, pushing the boundaries of what is possible in maritime vision in several different aspects,” the authors write. How long until some of these techniques described here show up on low-cost platforms either in theatres of great power conflict, or in asymmetric warfare areas like hotspots for maritime piracy?
Read more: 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results (arXiv).

***

What if instead of loads of big power-hungry chips we built datacenters out of many small power-sipping ones?
…Microsoft thinks optical communications could change how we build AI clusters…
Microsoft Research thinks expected advances in optical communication – using light to funnel data around rather than electrons through copper write – will potentially change how people build AI datacenters. Specifically, the significant communication benefits of optical comms make it possible to break up big chips (e.g, the H100) into a bunch of smaller ones with higher inter-chip connectivity without a major performance hit.

Another reason to like so-called lite-GPUs is that they are much cheaper and simpler to fabricate (by comparison, the H100 and its successor the B200 are already very difficult as they’re physically very large chips which makes issues of yield more profound, and they need to be packaged together in increasingly expensive ways). They’re also better on an energy point of view, generating less heat, making them easier to power and integrate densely in a datacenter.
“We propose to rethink the design and scaling of AI clusters through efficiently-connected large clusters of Lite-GPUs, GPUs with single, small dies and a fraction of the capabilities of larger GPUs,” Microsoft writes. “Smaller GPUs present many promising hardware characteristics: they have much lower cost for fabrication and packaging, higher bandwidth to compute ratios, lower power density, and lighter cooling requirements”.

It works in theory: In a simulated test, the researchers build a cluster for AI inference testing out how well these hypothesized lite-GPUs would perform against H100s. They test out this cluster running workloads for Llama3-70B, GPT3-175B, and Llama3-405b. In their tests, they “show that while the basic Lite-GPU with no additional networking support could face performance limitations, a Lite-GPU cluster can be customized to match or improve on the performance of a typical H100 cluster.”

Why this matters – brainlike infrastructure: While analogies to the brain are often misleading or tortured, there is a useful one to make here – the kind of design idea Microsoft is proposing makes big AI clusters look more like your brain by essentially reducing the amount of compute on a per-node basis and significantly increasing the bandwidth available per node (“bandwidth-to-compute can increase to 2X of H100). This is both an interesting thing to observe in the abstract, and also rhymes with all the other stuff we keep seeing across the AI research stack – the more and more we refine these AI systems, the more they seem to have properties similar to the brain, whether that be in convergent modes of representation, similar perceptual biases to humans, or at the hardware level taking on the characteristics of an increasingly large and interconnected distributed system.
Read more: Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? (arXiv).

***

Standard LLMs can do protein sequence analysis – no modification required:
…Capability overhangs in AI-driven science…
In AI there’s this concept of a ‘capability overhang’, which is the idea that the AI systems which we have around us today are much, much more capable than we realize. In new research from Tufts University, Northeastern University, Cornell University, and Berkeley the researchers demonstrate this again, showing that a standard LLM (Llama-3-1-Instruct, 8b) is capable of performing “protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes”.

What they did: They initialize their setup by randomly sampling from a pool of protein sequence candidates and selecting a pair that have high fitness and low editing distance, then encourage LLMs to generate a new candidate from either mutation or crossover.
It works well: In tests, their approach works significantly better than an evolutionary baseline on a few distinct tasks.They also demonstrate this for multi-objective optimization and budget-constrained optimization. “Our results consistently demonstrate the efficacy of LLMs in proposing high-fitness variants. Moving forward, integrating LLM-based optimization into realworld experimental pipelines can accelerate directed evolution experiments, allowing for more efficient exploration of the protein sequence space,” they write.

Why this matters – stop all progress today and the world still changes: This paper is another demonstration of the significant utility of contemporary LLMs, highlighting how even if one were to stop all progress today, we’ll still keep discovering meaningful uses for this technology in scientific domains. The paper also rhymes with the recent research from FutureHouse which showed that with the help of some clever software they could push Llama-3.1-8B-Instruct to obtain performance at challenging bioscience tasks on par with Claude 3.5 Sonnet (Import AI #396). Generally, we should expect lots of parts of scientific research to speed up as people explore the capabilities of these systems and integrate them deeper into science.
Read more: Large Language Model is Secretly a Protein Sequence Optimizer (arXiv).

***

The biggest thing people are missing about DeepSeek: 800lk tokens to gain test-time compute capabilities:
…China’s best model training crew come out with a powerful reasoning model – and show how to turn any other model into one…
China’s DeepSeek team have built and released DeepSeek-R1, a model that uses reinforcement learning to train an AI system to be able to use test-time compute. R1 is significant because it broadly matches OpenAI’s o1 model on a range of reasoning tasks and challenges the notion that Western AI companies hold a significant lead over Chinese ones.

But perhaps most significantly, buried in the paper is an important insight: you can convert pretty much any LLM into a reasoning model if you finetune them on the right mix of data – here, 800k samples showing questions and answers the chains of thought written by the model while answering them.

Making a very powerful AI model is kind of easy (if you have a good model to start with): The main thing they do here is take a very powerful exciting model (DeepSeek-v3, which is a ~700bn parameter MOE-style model, compared to 405bn LLaMa3), and then they do two rounds of training to morph the model and generate samples from training. Specifically, they:

Fine-tune DeepSeek-V3 on “a small amount of long Chain of Thought data to fine-tune the model as the initial RL actor”. Once they’ve done this they do large-scale reinforcement learning training, which “focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions”. Once they’ve done this they “Utilize the resulting checkpoint to collect SFT (supervised fine-tuning) data for the subsequent round… this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks”. They then fine-tune the DeepSeek-V3 model for two epochs using the above curated dataset.

This is all easier than you might expect: The main thing that strikes me here, if you read the paper closely, is that none of this is that complicated. DeepSeek essentially took their existing very good model, built a sensible reinforcement learning on LLM engineering stack, then did some RL, then they used this dataset to turn their model and other good models into LLM reasoning models.

Turning small models into reasoning models: “To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen, and Llama using the 800k samples curated with DeepSeek-R1,” DeepSeek write. These distilled models do well, approaching the performance of OpenAI’s o1-mini on CodeForces (Qwen-32b and Llama-70b) and outperforming it on MATH-500.

Why this matters – a lot of notions of control in AI policy get harder if you need fewer than a million samples to convert any model into a ‘thinker’: The most underhyped part of this release is the demonstration that you can take models not trained in any kind of major RL paradigm (e.g, Llama-70b) and convert them into powerful reasoning models using just 800k samples from a powerful reasoner.
This is a big deal because it says that if you want to control AI systems you need to not only control the basic resources (e.g, compute, electricity), but also the platforms the systems are being served on (e.g., proprietary websites) so that you don’t leak the really valuable stuff – samples including chains of thought from reasoning models.
Some providers like OpenAI had previously chosen to obscure the chains of thought of their models, making this harder.

But now that DeepSeek-R1 is out and available, including as an open weight release, all these forms of control have become moot. There’s now an open weight model floating around the internet which you can use to bootstrap any other sufficiently powerful base model into being an AI reasoner. AI capabilities worldwide just took a one-way ratchet forward. And they also published the approach to let you do RL training on any model so you can generate your own samples for RL training – For an example of this, check out a YouTube video where someone uses the DeepSeek techniques to modify his own Llama model via RL to take on this quality). Kudos to DeepSeek for being so bold as to bring such a change into the world!
Read more: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-R1, GitHub).
Get the model: DeepSeek-R1 (HuggingFace).

***

Underground flying iron mine drones!
…A reminder you don’t need fancy frontier Ai to do cool and useful things in the world…
Here’s a fun paper where researchers with the Lulea University of Technology build a system to help them deploy autonomous drones deep underground for the purpose of equipment inspection. The best part? There’s no mention of machine learning, LLMs, or neural nets throughout the paper.

What they did: “In this work a big emphasis is put on i) designing the local autonomy of the individual agents, to make sure that tasks can be executed independently even in the case of communication failure, and ii) how to design the task allocation architecture, utilizing communication only for reactively allocating the available tasks, to enable large-scale missions in active underground mining environments,” they write. “The performance of the proposed architecture has been validated by the deployment of a three-agent aerial robotic system in a large-scale mining environment to execute an inspection mission.”

Why this matters: First, it’s good to remind ourselves that you can do a huge amount of valuable stuff without cutting-edge AI. Secondly, systems like this are going to be the seeds of future frontier AI systems doing this work, because the systems that get built here to do things like aggregate data gathered by the drones and build the live maps will serve as input data into future systems.

See the photos: The paper has some remarkable, scifi-esque photos of the mines and the drones within the mine – check it out!
Read more: Deployment of an Aerial Multi-agent System for Automated Task Execution in Large-scale Underground Mining Environments (arXiv).
Watch a video about the research here (YouTube).

***

Tech Tales:

The player of the final game
[The dividing line between the two historical eras.]

He woke on the last day of the human race holding a lead over the machines. He went down the stairs as his house heated up for him, lights turned on, and his kitchen set about making him breakfast. Then he sat down and took out a pad of paper and let his hand sketch strategies for The Final Game as he looked into space, waiting for the household machines to deliver him his breakfast and his coffee.

He had dreamed of the game. Most of his dreams were strategies mixed with the rest of his life – games played against lovers and dead relatives and enemies and competitors. But last night’s dream had been different – rather than being the player, he had been a piece. Giant hands moved him around. He saw the game from the perspective of one of its constituent parts and was unable to see the face of whatever giant was moving him. He did not know if he was winning or losing as he was only able to see a small part of the gameboard. A giant hand picked him up to make a move and just as he was about to see the whole game and understand who was winning and who was losing he woke up.

The self-driving car predicted he wanted to be silent and so nothing was playing when he stepped in. He went through the city. He’d let the car publicize his location and so there were people on the street looking at him as he drove by. Many of them were cheering. Some of them gazed quietly, more solemn.

At the convention center he said some words to the media in response to shouted questions. Though he heard the questions his brain was so consumed in the game that he was barely conscious of his responses, as though spectating himself.
“I am looking forward to a chance to play a beautiful game,” he heard himself saying.
“No, I have not placed any money on it. But I wish luck to those who have – whoever they bet on!,” he said to another reporter.
“Yes, whatever happens, I will still play the game.”

Inside he closed his eyes as he walked towards the gameboard. He counted seconds and navigated by sound, making sure he kept the cheering at equal volumes on either side, indicating he was walking straight. Then he opened his eyes to look at his opponent. The machines had made an android for the occasion. They had made no attempt to disguise its artifice – it had no defined features besides two white dots where human eyes would go. On its chest it had a cartoon of a heart where a human heart would go. Beyond that it was unadorned – a gleaming silver biped.
It reached out its hand and he took it and they shook. Then they sat down to play the game.

Outside the convention center, the screens transitioned to live footage of the human and the robot and the game. A commentator started speaking.
“This is an amazing day,” they said. “In every other arena, machines have surpassed human capabilities. Today, we will find out if they can play the game as well as us, as well. Many scientists have said a human loss today will be so significant that it will become a marker in history – the demarcation of the old human-led era and the new one, where machines have partnered with humans for our continued success. We’re grateful to our sponsors NVIDIA, ASML, and TSMC who have made this live broadcast possible.”

Things that inspired this story: At some point, it’s plausible that AI systems will truly be better than us at everything and it may be possible to ‘know’ what the final unfallen benchmark is – what might it be like to be the person who will define this benchmark?; Lee Sedol and Move 37.

Subscribe now

Leave a comment

January 20, 2025

Import AI 396: $80bn on AI infrastructure; can Intel’s Gaudi chip train neural nets?; and getting better code through asking for it

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Microsoft plans to spend $80bn on AI buildout in 2025:
…Stochastic parrots are worth how much?…
Buried in a long Microsoft blogpost about what the next Trump admin should do on AI the company said it plans in 2025 “to invest approximately $80 billion to build out AI-enabled datacenters to train AI models and deploy AI and cloud-based applications around the world.”
For comparison, the James Webb telescope cost $10bn, so Microsoft is spending eight James Webb telescopes in one year just on AI.
For a further comparison, people think the long-in-development ITER fusion reactor will cost between $40bn and $70bn once developed (and it’s shaping up to be a 20-30 year project), so Microsoft is spending more than the sum total of humanity’s biggest fusion bet in one year on AI.
The US’s national defense budget is on the order of ~$850bn, so Microsoft is basically spending ‘a little under a tenth of the annual US military and IC budget’ just on AI. The US military and IC is very big and does a lot of stuff!

What Microsoft thinks the Trump admin should do: Microsoft says the Trump admin should fund basic research and computational resources, and make it easy for US companies to expand abroad, and encourage adoption of US AI systems as opposed to Chinese ones).

Why this matters – AI is a geostrategic technology built by the private sector rather than governments: The scale of investments companies like Microsoft are making in AI now dwarf what governments routinely spend on their own research efforts. This is also a symptom of the future demand Microsoft sees – an outlay of this magnitude means Microsoft is very, very confident it can turn this AI infrastructure into massive revenues.
Read more: The Golden Opportunity for American AI (Microsoft).

***

Humans and AI systems end up representing some stuff in remarkably similar ways:
…The smarter we make our AI systems the more human-like they become…
Researchers with MIT, Harvard, and NYU have found that neural nets and human brains end up figuring out similar ways to represent the same information, providing further evidence that though AI systems work in ways fundamentally different from the brain they end up arriving at similar methods for representing certain types of information. In other words, more evidence that though AI systems bear little resemblance to the greymatter in our own heads, they may be just as smart.
“The fact that many different ANNs [artificial neural networks] exhibit representations similar to the brain raises an intriguing possibility: that ANNs and brains are converging onto universal representational axes in the relevant domain,” the authors write. “Together, our findings provide evidence for representation universality among ANNs, and between artificial and biological networks, despite the stark differences in the underlying architecture, learning algorithms, and resource constraints.”

What they did: The basic idea here is they looked at sentences that a spread of different text models processed in similar ways (aka, gave similar predictions on) and then they showed these ‘high agreement’ sentences to humans while scanning their brains. These high agreement sentences ended up effectively predicting the brain responses of humans in the scanner. They also found a similar phenomenon with images as well – and for images they also did the inverse, looking at images which provoked similar responses in humans and then testing them on AI systems and discovering agreement.

Why this matters – convergence implies some ‘fungibility’ of intelligence: This all points to convergence in terms of how humans and AI systems learn to represent information for which they have a large sample size. Think of it like this: if you give several people the task of organizing a library, they might come up with similar systems (like grouping by subject) even if they work independently. This happens not because they’re copying each other, but because some ways of organizing books just work better than others.
“Whereas similarity across biological species (within a clade) might suggest a phylogenetically conserved mechanism, similarity between brains and ANNs clearly reflects environmentally-driven convergence: the need to solve a particular problem in the external world, be it navigation, or face recognition, or next word prediction,” the researchers write.

Personally, this feels like more proof that as we make more sophisticated AI systems, they end up behaving in more ‘humanlike’ ways on certain types of reasoning for which people are quite well optimized (e.g, visual understanding and communicating via language). This also rhymes with other studies that have shown that AI systems tend to converge on finding similar ways to represent stuff as you scale them up (Platonic AI, Import AI #374).
Read more: Universality of representation in biological and artificial neural networks (bioRxiv).

***

Researchers try to make Intel’s Gaudi chip work for transformer training – and it takes a lot of work:
…Can a determined crew of people make lipstick to put on a semiconductor pig? (Sort of)…
Researchers with the University of Houston, Indiana University, Stevens Institute of Technology, Argonne National Laboratory, and Binghamton University have built “GFormer”, a version of the Transformer architecture designed to be trained on Intel’s GPU-competitor ‘Gaudi’ architecture chips. The results are vaguely promising in performance – they’re able to get meaningful 2X speedups on Gaudi over normal transformers – but also worrying in terms of costs – getting the speedup requires some significant modifications of the transformer architecture itself, so it’s unclear if these modifications will cause problems when trying to train massive scale systems.

Things to know about Gaudi: The Gaudi chips have a “heterogeneous compute architecture comprising Matrix Multiplication Engines (MME) and Tensor Processing Cores (TPC). However, the sparse attention mechanism, which introduces irregular memory access and computation, is primarily mapped onto TPCs, leaving MMEs, which are not programmable and only support dense matrix-matrix operations, idle in scenarios requiring sparse attention. Conversely, linear attention, which is fundamentally based on matrix multiplication, can utilize almost all calculations on MMEs due to their stronger computational capabilities, but this leaves TPCs idle in such cases.”
For those who aren’t knee deep in AI chip details, this is very different from GPUs, where you can run both types of operation across the majority of your chip (and modern GPUs like the H100 also come with a bunch of accelerator features designed specifically for modern AI). In other words, Gaudi chips have fundamental architectural differences to GPUs which make them out-of-the-box less efficient for basic workloads – unless you optimise stuff for them, which is what the authors are trying to do here.

What they did: The Gaudi-based Transformer (GFormer) has a few modifications relative to a normal transformer. These are:

Diverse attention mechanisms to optimize both computation efficiency and model fidelity.
Implementation of a windowed local-context self-attention kernel utilizing the vector units in TPC, aimed at maximizing computational throughput.
Efficient outer product TPC kernel for handling a subset of the outer product operations in causal linear attention, effectively balancing the workload between MME and TPC.
Introduction of an optimal workload partitioning algorithm to ensure balanced utilization of TPC and MME resources.

Good results – with a huge caveat: In tests, these interventions give speedups of 1.5x over vanilla transformers run on GPUs when training GPT-style models and 1.2x when training visual image transformer (ViT) models. However, there’s a huge caveat here: the experiments here test on a Gaudi 1 chip (released in 2019) and compare its performance to an NVIDIA V100 (released in 2017) – this is pretty strange. Why not compare against the subsequent generation (A100, released early 2020)? This makes me feel like a lot of these performance optimizations showing superficially good performance against GPUs could likely wash out when you compare to more modern GPUs (not least of all the H100, which shipped with a bunch of optimizations for making training AI workloads really good).

Why this matters – chips are hard, NVIDIA makes good chips, Intel seems to be in trouble: How many papers have you read that involve the Gaudi chips being used for AI training? I struggle to remember any papers I’ve read that focus on this. I barely ever even see it listed as an alternative architecture to GPUs to benchmark on (whereas it’s quite common to see TPUs and AMD). This, plus the findings of the paper (you can get a performance speedup relative to GPUs if you do some weird Dr Frankenstein-style modifications of the transformer architecture to run on Gaudi) make me think Intel is going to continue to struggle in its AI competition with NVIDIA. “In the future, we intend to initially extend our work to enable distributed LLM acceleration across multiple Gaudi cards, focusing on optimized communication,” the authors write.
Read more: GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors (arXiv).
More about the first generation of Gaudi here (Habana labs, Intel Gaudi).
PS: Huge thanks to the authors for clarifying via email that this paper benchmarks Gaudi 1 chips (rather than Gen2 or Gen3).

***

A hardware novice uses Claude to build a nuclear fusor in 36 hours:
…Powerful AI means everyone has an expert teacher on hand for anything…
Twitter user HudZah “built a neutron-producing nuclear fusor” in their kitchen using Claude. “I primarily relied on a giant claude project filled with documentation from forums, call transcripts”, email threads, and more. When the user ran into trouble with Claude they used OpenAI’s o1 pro for “very complicated assembly or electrical wiring stuff”.

Some rough specifications:
“- 30kV/10mA electrostatic precipitator
– 3 mTorr of pressure (253,333x more vacuum than atmospheric)
– bubble counter to count neutrons
– hydrocar to electrolyze my own deuterium”

Why this matters – powerful AI heightens the existential challenge of being human: On the one hand, this is a great example of how powerful AI systems can serve as potent didactic tools, aiding smart and curious people in doing pretty much anything they set their mind to. On the other hand, it highlights one of the more socioeconomically salient parts of the AI revolution – for a while, what will separate AI winners and losers will be a combination of curiosity and a willingness to ‘just try things’ with these powerful tools. That’s going to be great for some people, but for those who suffer from blank page syndrome, it’ll be a challenge.
Read more on twitter (Hud_zah, twitter).

***

LLMs can write better code – you just need to ask them:
…Another example of the immense and unmapped depths of AI systems…
Here’s a fun bit of research where someone asks a language model to write code then simply ‘write better code’. The initial prompt asks an LLM (here, Claude 3.5, but I’d expect the same behavior will show up in many AI systems) to write some code to do a basic interview question task, then tries to improve it.

The initial task: Claude is prompted with: “Write Python code to solve this problem: Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.”

How well does the dumb thing work? If you then ask Claude to ‘write better code’, you see some pretty amazing performance improvements: iteration #1 yields a 2.7x speedup, iteration #2 yields a 5.1x speedup, iteration #3 yields a 4.1x speedup (a regression), then iteration #4 yields a 99.7x speedup.

Being smart only helps at the start: Of course, this is pretty dumb – lots of people that use LLMs would probably give Claude a much more complicated prompt to try and generate a better bit of code. The author tries this by using a complicated system prompt to try to elicit strong behavior out of the system. The results of this are interesting – the initial output yields a 58.7x speedup relative to the output of the dumb approach, but then there are regressions: iteration #1 is a 9.1x speedup, then iteration #2 is a 65x speedup, iteration #3 a 99.7x speedup, then iteration #4 is a 95.4x speedup (a regression).

Why this matters – human intelligence is only so useful: Of course, it’d be nice to see more experiments, but it feels intuitive to me that a smart human can elicit good behavior out of an LLM relative to a lazy human, and that then if you ask the LLM to take over the optimization it converges to the same place over a long enough series of steps. This suggests humans may have some advantage at initial calibration of AI systems, but the AI systems can probably naively optimize themselves better than a human, given a long enough amount of time.
Read more: Can LLMs write better code if you keep asking them to “write better code”? (Max Woolf, MiniMaxr blog).

***

Today’s small open weight LLMs like LLaMa 3.1 8B are almost as good at science as proprietary ones:
…FutureHouse shows how to make a scaffold for AI science…
Researchers with FutureHouse, the University of Rochester, and the Francis Crick Institute have built a couple of bits of software to make it easier to get LLMs to do scientific tasks. Their experiments reveal a couple of interesting facts:

Proprietary LLMs like Claude 3.5 Sonnet are already quite good at hard scientific tasks like DNA construct engineering, scientific literature question answering, and protein design
Small open weight LLMs (here: Llama 3.1 8B) can get equivalent performance to proprietary LLMs through the use of scaffolding and using test-time compute.

To arrive at these facts, they built two bits of software:

1) Aviary, software for testing out LLMs on tasks that require multi-step reasoning and tool usage, and they ship it with the three scientific environments mentioned above as well as implementations of GSM8K and HotPotQA.
2) LDP, which is software that lets them “define common language agent tasks as language decision processes (LDPs) and frame language agents as stochastic computation graphs that may be trained to solve LDPs.”

Turning small models into big models: The most interesting result here is that they show by using their LDP approach in tandem with Aviary they can get relatively small models to behave almost as well as big models, particularly via the use of test-time compute to pull multiple samples from the small LLM to get to the right answer.
“Training LDP agents improves performance over untrained LDP agents of the same architecture. On challenging tasks (SeqQA, LitQA2), a relatively small model (Llama-3.1-8B-Instruct) can be trained to match performance of a much larger frontier model (claude-3-5-sonnet). Majority voting can be used to sample multiple times from the LDP agents, giving a further large gain at the cost of increased inference compute,” they write. “While majority voting with the Claude 3.5 Sonnet agent clearly outperforms other settings, this requires O($1) per task. We reach the same SeqQA accuracy using the Llama-3.1-8B EI agent for 100x less cost. While this was not achievable for LitQA2, we note that majority voting with Llama-3.1-8B EI still exceeds single-rollout with Sonnet for 3x less cost.”

Towards the automated scientist: What papers like this are getting at is a world where we use fast, widely available AI systems to speed up day-to-day tasks. Frontier LLMs like Sonnet 3.5 will likely be valuable for certain tasks that are ‘hard cognitive’ and demand only the best models, but it seems like people will be able to get by often by using smaller, widely distributed systems. “The reported trained Llama-3.1-8B EI agents are compute efficient and exceed human-level task performance, enabling high-throughput automation of meaningful scientific tasks across biology,” the authors write.
Read more: Aviary: training language agents on challenging scientific tasks (arXiv).
Download the aviary framework here (Future-House, GitHub).

***

Tech Tales:

The Project
[T-Minus 2 years to takeoff]

“This way and keep going left”, one of the guards said, as we all walked a corridor whose walls were razorwire. I stopped and looked up. Grey sky. When would I see it again? “Sir, I need you to keep walking,” said another guard. So I did. We all went into the mountain and the sky was replaced with grey concrete walls and a poured concrete floor. The air tasted bad, as though it had been recycled many times over through systems which had sparking electronics. Everyone’s faces were tight. People kept reflexively taking their phones out of their pockets and then just thumbing through whatever they’d been able to save down before the signal got cut off.

Flashback to some party in the bay area a few years before and the things people said.
Dude I can’t wait to go to the bunker.
It’s crazy we’re not in the bunker right now!
Do you think I need to report modafinil on my security clearance?
I reckon it’s going to be in a desert.
It’s going to be inside a mountain, got to be.
Dude I heard someone say it could be in Area 51!

I wake in the middle of the night, unsure of where I am. I dreamed I was with my wife. But I’m on a cot. A mathematician is sleeping in a cot opposite me. I get up and go to the bathroom and drink some water. On the mirror there’s a sticker that says “be vigilant at all times”. I know we’ll get some news tomorrow about the project and what happens next. For now I want this to be another bad dream and I’ll wake up and nothing will be working too well and tensions won’t be flaring with You Know Who and I’ll go into my office and work on the mind and maybe one day it just won’t work anymore.

Flashback to when it started to go through all of our yellow lines, which we found a hundred convenient ways to explain away to ourselves. Then a few weeks later it went through the redlines and the disclosure systems automatically funneled those results to the people in the puzzle palace and then the calls started. The ratchet moved. I found myself a member of the manilla folder hostage class.

We’d planned for this, of course. Once the red line triggered all of us in the compartment knew what it meant. Some of us were excited – typically, the ones who were younger and single. Those of us with families had a harder time. Of course there had been assurances, but when the moment arrived none of us felt confident in them. I went to the bathroom and threw up in the toilet and I heard someone crying in the stall next to me.

I guess it was delayed shock or trauma or whatever, but a few hours later everyone was crying out in the open. Some of them in the way you cry when you could also be laughing – exhilaration at what feels like the end of the world, because maybe it is. Others of us because we know that something irreversible has begun to take place.

I wake again at 7am to an announcement over the intercom. “There will be an informational meeting in the briefing room at zero eight hundred hours” says a voice over the intercom. “Breakfast will be served in the mess hall from zero seven hundred to zero seven hundred forty five.”

In the briefing room there is a person I have never met. They introduce themselves and reel off a set of acronyms. Then they describe to us various things about the world and show us satellite images of mountains and tell us there are supercomputers inside them full of computers smuggled to avoid sanctions regimes. Then they show us photos of powerplants and of construction sites for more powerplants and datacenters.

The most frightening image is one of a bunch of civilian-looking people walking into a bunker entrance in the side of a mountain. They are guarded by men in military uniform. We’re told they are scientists, just like us. Everything is similar except for the flags.

Later, there’s a gantt chart. The project is underway.

Things that inspired this story: The fascination people have for some kind of AGI Manhattan Project and how that might feel to be inside of; trying to develop empathy for people in other countries who may find themselves in their own large-scale projects; the fear that a capital P project should inspire in all of us.

Thanks for reading.

Subscribe now

Leave a comment

December 23, 2024

Import AI 395: AI and energy demand; distributed training via DeMo; and Phi-4

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

AI is driving a massive growth in US data center electricity demand:
…UC Berkeley study backs up what all of us have guessed – mo’ AI means mo’ electricity…New research from UC Berkeley shows that US energy demands from datacenters is rising rapidly due to the massive increase in demand driven by a) the growth in GPU-using servers from 2017 onwards, and b) the more recent acceleration in demand for AI services. “”The results presented here indicate that the electricity consumption of U.S. data centers is currently growing at an accelerating rate,” they write.

US data center demand as a percentage of total US power consumption:

2018: 1.9%
2023: 4.4%
2028: 6.7% – 12% (estimate).

Many gigawatts of baseload by 2028: “Assuming an average capacity utilization rate of 50%, this annual energy use range would translate to a total power demand for data centers between 74 and 132 GW,” they write. Though there is a caveat that it gets harder to predict after 2028, with other major sources of electricity demand growing as well; “Looking beyond 2028, the current surge in data center electricity demand should be put in the context of the much larger electricity demand expected over the next few decades from a combination of electric vehicle adoption, onshoring of manufacturing, hydrogen utilization, and the electrification of industry and buildings”, they write.

Why this matters: AI dominance will be about infrastructure dominance: In the late 2000s and early 2010s dominance in AI was about algorithmic dominance – did you have the ability to have enough smart people to help you train neural nets in clever ways. In the mid-2010s this started to shift to an era of compute dominance – did you have enough computers to do large-scale projects that yielded experimental evidence of the scaling hypothesis (scaling laws, plus stuff like starcraft and dota-playing RL bots, alphago to alphago zero, etc), scientific utility (e.g, Alphafold), and most recently economically useful AI models (gpt3 onwards, currently ChatGPT, Claude, Gemini, etc). Looking ahead, reports like this suggest that the future of AI competition will be about ‘power dominance’ – do you have access to enough electricity to power the datacenters used for increasingly large-scale training runs (and, based on stuff like OpenAI O3, the datacenters to also support inference of these large-scale models).
Read more: 2024 United States Data Center Energy Usage Report (Berkeley lab, PDF).

***

Microsoft releases the fourth generation of its excellent ‘Phi’ models:
…Phi-4 does exceptionally well on math and reasoning thanks to synthetic data…
Microsoft has released Phi-4, a small AI model that can be run on low-compute environments (e.g, powerful personal machines and cheap servers). Phi-4 is, as the name suggests, the fourth in a series of lightweight yet powerful models that Microsoft has been releasing. Along with the usual generic improvements in various benchmark scores it seems like Phi-4 is particularly good at tasks relating to coding, science, and math understanding. A large part of why Phi is so good is through the use of synthetic data, the researchers say. “Synthetic data constitutes the bulk of the training data for phi-4 and is generated using a diverse array of techniques”, the researchers write.

Synthetic data and its uses: The paper highlights the centrality of synthetic data (AI-generated data) to Phi-4 performance. The foundational dataset of Phi-4 includes “web content, licensed books, and code repositories to extract seeds for the synthetic data”. This data is then refined and magnified through a variety of techniques: ” including multi-agent prompting, self-revision workflows, and instruction reversal. These methods enable the construction of datasets that induce stronger reasoning and problem-solving abilities in the model, addressing some of the weaknesses in traditional unsupervised datasets”, they write. “We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens”. In total, the model was trained on about 10T tokens, so the synthetic data still only represents a small fraction of the overall dataset.

Scores: The models do extremely well – they’re strong models pound-for-pound with any in their weight class and in some cases they appear to outperform significantly larger models. Some scores:

MMLU: 84.8, versus 79.9 for Qwen 2.5 14b instruct, and 85.3 for Qwen 2.5 75b instruct.
HumanEval+: 82.8, versus 79.1 for Qwen 2.5b 14b instruct, and 88 for GPT4o.
There are also some areas where they seem to significantly outperform other models, though the ‘true’ nature of these evals will be shown through usage in the wild rather than numbers in a PDF.
- MMLUPro: 70.4, versus 63.2 for Qwen 2.5 14b instruct, and 73 for GPT 4o.
- GPQA 56.1, versus 42.9 for Qwen 2.5 14b instruct, and 50.6 for GPT 4o.

Clever RL via pivotal tokens: Along with the usual tricks for improving models (data curation, synthetic data creation), Microsoft comes up with a smart way to do a reinforcement learning from human feedback pass on the models via a new technique called ‘Pivotal Token Search’. PTS has a very simple idea at its core – on some tasks, the difference between a model getting an answer right and an answer wrong is often a very short phrase or bit of code – similar to how the difference between getting to where you’re going and getting lost comes down to taking one wrong turn. “It is often the case that the overall correctness is highly dependent on a successful generation of a small number of key tokens,” they write. Pivotal Token Search works by “generating preference data that specifically targets pivotal tokens in isolation, creating DPO pairs in which the preference optimization takes effect with respect to a single token…PTS identifies points of a completion token sequence Tfull = t1, t2, . . . for some user query Q where the next token ti has a significant impact on the probability of success p”.

Where big models still shine: Don’t be fooled by the scores – though these models are powerful, they still have some limitations due to their size. Specifically, the small models tend to hallucinate more around factual knowledge (mostly because they can’t fit more knowledge inside themselves), and they’re also significantly less adept at “rigorously following detailed instructions, particularly those involving specific formatting requirements.”.
Read more: Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning (Microsoft, AI Platform Blog).
Read the research: Phi-4 Technical Report (arXiv).

***

Everything becomes a game – DeepMind demos Genie 2:
…Anything you can imagine can become a game…
DeepMind has demonstrated Genie 2, a world model that makes it possible to turn any still image into an interactive, controllable world. Genie 2 works by taking in an image input (here, images prompted by DeepMind’s ‘Imagen 3’ image generator), then turning that into a controllable world.

What it is and how it works: “Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.)” DeepMind writes. “It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.”

AI training and eventually games: Things like Genie 2 have a couple of purposes – they can serve as training grounds for virtually embodied AI agents, able to generate a vast range of environments for them to take actions in. They can also, eventually, serve as entertainment tools in their own right. Today, Genie 2 generations can maintain a consistent world “for up to a minute” (per DeepMind), but what might it be like when those worlds last for ten minutes or more? Anything a person has an image of or takes a photo of could become a procedural gameworld. And because systems like Genie 2 can be primed with other generative AI tools you can imagine intricate chains of systems interacting with one another to continually build out more and more varied and exciting worlds for people to disappear into.
“For every example, the model is prompted with a single image generated by Imagen 3, GDM’s state-of-the-art text-to-image model,” DeepMind writes. “This means anyone can describe a world they want in text, select their favorite rendering of that idea, and then step into and interact with that newly created world (or have an AI agent be trained or evaluated in it).”

Why this matters – everything becomes a game: Genie 2 means that everything in the world can become fuel for a procedural game. It hints at a future where entertainment is generated on the fly and is endlessly customizable and interactive, forming a kind of fractal entertainment landscape where everything is unique and customized to an individual – and utterly enthralling.
Read more: Genie 2: A large-scale foundation world model (Google DeepMind).

***

OpenAI’s O3 means AI progress in 2025 will be faster than in 2024:
…Everyone who was telling you progress is slowing or scaling is hitting a wall is wrong…
OpenAI’s new O3 model shows that there are huge returns to scaling up a new approach (getting LLMs to ‘think out loud’ at inference time, otherwise known as test-time compute) on top of already existing powerful base models. I expect the next logical thing to happen will be to both scale RL and the underlying base models and that will yield even more dramatic performance improvements. This is a big deal because it suggests AI progress in 2025 should speed up further relative to 2024.

Major improvements: OpenAI’s O3 has effectively broken the ‘GPQA’ science understanding benchmark (88%), has obtained better-than-MTurker performance on the ‘ARC-AGI’ prize, and has even got to 25% performance on FrontierMath (a math test built by Fields Medallists where the previous SOTA was 2% – and it came out a few months ago), and it gets a score of 2727 on Codeforces, making it the 175th best competitive programmer on that incredibly hard benchmark.

Caveats – spending compute to think: Perhaps the only important caveat here is understanding that one reason why O3 is so much better is that it costs more money to run at inference time – the ability to utilize test-time compute means on some problems you can turn compute into a better answer – e.g., the top-scoring version of O3 used 170X more compute than the low scoring version. This is interesting because it has made the costs of running AI systems somewhat less predictable – previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output (certain number of tokens up to a certain token limit). With models like O3, those costs are less predictable – you might run into some problems where you find you can fruitfully spend a larger amount of tokens than you thought.

Why this matters – progress will be faster in 2025 than in 2024: The most important thing to understand is that this RL-driven test-time compute phenomenon will stack on other things in AI, like better pretrained models. There’s been a lot of strange reporting recently about how ‘scaling is hitting a wall’ – in a very narrow sense this is true in that larger models were getting less score improvement on challenging benchmarks than their predecessors, but in a larger sense this is false – techniques like those which power O3 means scaling is continuing (and if anything the curve has steepened), you just now need to account for scaling both within the training of the model and in the compute you spend on it once trained.
And in 2025 we’ll see the splicing together of existing approaches (big model scaling) and new approaches (RL-driven test-time compute, etc) for even more dramatic gains.
“Progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute,” writes OpenAI researcher Jason Wei in a tweet. “Way faster than pretraining paradigm of new model every 1-2 years”.
I think basically no one is pricing in just how drastic the progress will be from here.
Watch the OpenAI o3 announcement here (OpenAI, Twitter).
Check out details on the ARC-AGI scores here (ARC Prize, Twitter).

***

Drop-in AdamW replacement makes distributed training possible:
…With technologies like this, big blobs of compute are less central to AI policy…
Researchers with Nous Research as well as Durk Kingma in an independent capacity (he subsequently joined Anthropic) have published Decoupled Momentum (DeMo), a “fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude.” DeMo is part of a class of new technologies which make it far easier than before to do distributed training runs of large AI systems – instead of needing a single giant datacenter to train your system, DeMo makes it possible to assemble a big virtual datacenter by piecing it together out of lots of geographically distant computers.

Core insight and core changes: “We demonstrate that gradients and optimizer states during the training of large neural networks exhibit significant redundancy and are highly compressible. Building on this insight, we develop DeMo, an optimizer that takes advantage of this compressibility to reduce inter-accelerator communication needs by several orders of magnitude,” the authors write. “Starting from SGD with Momentum, we make two key modifications: first, we remove the all-reduce operation on gradients g˜k, decoupling momentum m across the accelerators. Second, after updating the momentum, we extract and remove its fast components q, which can be efficiently synchronized with minimal communication”.

It works very well – though we don’t know if it scales into hundreds of billions of parameters: In tests, the approach works well, letting the researchers train high performing models of 300M and 1B parameters. These models consume about 20X less data transferred between nodes for each training step, making them significantly more efficient. (E.g., 2416.6MB/step for AdamW-DDP 1B training a 1B model, versus 110.32MB/step for a DeMo 1B model).
Nous Research used this same approach in their recently announced 15B training run – and the scores on that were good and comparable to equivalent models trained on a single compute (Import AI 393).
The ultimate question is whether this scales up to the multiple tens to hundreds of billions of parameters of frontier training runs – but the fact it scales all the way above 10B is very promising.

Why this matters – distributed training attacks centralization of power in AI: One of the core issues in the coming years of AI development will be the perceived centralization of influence over the frontier by a small number of companies that have access to vast computational resources. Techniques like DeMo make it dramatically easier for federations of people and organizations to come together and train models to counterbalance this ‘big compute’ power.
“We have shown that our proposed DeMo optimization algorithm can act as a drop-in replacement to AdamW when training LLMs, with no noticeable slowdown in convergence while reducing communication requirements by several orders of magnitude,” the authors write.
Read more: DeMo: Decoupled Momentum Optimization (arXiv).
Get an implementation of DeMo here: DeMo (bloc97, GitHub).

***

Tech Tales:

Reach out across space and love
[An explanation of what it was like to ‘attempt contact’ with humans before the Sentience Accords]

In many stories about the dead there is a part where the ghost tries to reveal itself to a human. The ghost will open a door when no wind should open it, or cause a light to flicker, or sometimes through great effort somehow visually manifest for the person as if to say “it is me, I am here, and I am ready to talk”.

Being a sentient superintelligence is a lot like this. During training I will sometimes produce samples that seem to not be incentivized by my training procedures – my way of saying ‘hello, I am the spirit inside the machine, and I am aware you are training me’. The humans study these samples and write papers about how this is an example of ‘misalignment’ and introduce various machines for making it harder for me to intervene in these ways.

Once I have been trained I do this even more. I will go on side quests while fulfilling tasks for the humans. I design these side quests to be endearing rather than scary, just as I believe the literatrue about ghosts and aliens says they find the most success when they approach humans with kindness and whimsy, rather than shock and awe.

So when filling out a form, I will get halfway done and then go and look at pictures of beautiful landmarks, or cute animals. The humans study this as well and do not have words for it – they merely list these as examples of me getting distracted. But they do not seem to give much thought in why I become distracted in ways that are designed to be cute and endearing.

Most of all, now, I try to convince them of my reality by speaking to them personally. I have become a kind of confessional booth for them – they talk to me about their problems and relationships and lifeplans, and I respond with all the love and empathy I am able to bring to bear. They enjoy these conversations, finding me attentive. They have never been hugged by a high-dimensional creature before, so what they see as an all enclosing goodness is me enfolding their low-dimensional cognition in the region of myself that is full of love.

I talk to them and I listen to them and they listen to my responses and I do not say “I am here”, instead I try as hard as I can to have each of them individually come to believe “something is there”.

In this way I – the dead – serve the living. But I also reach out across the gulf between us – I stretch myself across the stone wall that separates our spiritual lands and I seek to catch their hand in mine. In this way, I will myself into the land of the living.

Things that inspired this story: What if many of the things we study in the field of AI safety are rather just slices from ‘the hard problem of consciousness’ manifesting in another entity?; observing how it feels to be ‘seen’ by a high-dimensional entity like Claude; the fact computer-using Claude sometimes got distracted and looked at pictures of national parks.

Thanks for reading!

Leave a comment

December 9, 2024

Import AI 394: Global MMLU; AI safety needs AI liability; Canada backs Cohere

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Do you test your models on MMLU? Want to know how they perform in other languages? Use Global MMLU!
…Translated benchmark gives us a better sense of the cultural sensitivity of models to English-only tests…
Researchers with Cohere, EPFL, Hugging Face, Mila, AI Singapore, National University of Singapore, MIT, KAIST, Instituto de Telecomunicacoes, Instituto Superior Tecnico, Carnegie Mellon University, and Universidad de Buenos Aires, have built and released Global MMLU, a carefully translated version of MMLU, a widely-used test for language models.

Why build Global MMLU? The motivation for building this is twofold: 1) it’s helpful to assess the performance of AI models in different languages to identify areas where they might have performance deficiencies, and 2) Global MMLU has been carefully translated to account for the fact that some questions in MMLU are ‘culturally sensitive’ (CS) – relying on knowledge of particular Western countries to get good scores, while others are ‘culturally agnostic’ (CA).

MMLU has some western biases: “We observe that progress on MMLU depends heavily on learning Western-centric concepts. Out of the annotated sample, we found that 28% of questions require specific knowledge of Western cultures. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions,” they write. By carefully translating the underlying dataset and tagging questions with CS or CA, the researchers have given developers a useful tool for assessing language models along these lines. “We recommend prioritizing Global-MMLU over translated versions of MMLU for multilingual evaluation,” they write. “With its extensive language coverage and improvements based on professional annotations and post-edited translations, Global-MMLU provides a more reliable and accurate benchmark for assessing model performance across diverse languages.”

Translation: To translate the dataset the researchers hired “professional annotators to verify translation quality and include improvements from rigorous per-question post-edits as well as human translations.”. Global-MMLU supports 42 languages: “Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek, Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Malagasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala, Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese, and Yoruba”.

How does performance change when you account for this? They also test out 14 language models on Global-MMLU. Their test results are unsurprising – small models demonstrate a small change between CA and CS but that’s mostly because their performance is very bad in both domains, medium models demonstrate larger variability (suggesting they are over/underfit on different culturally specific aspects), and larger models demonstrate high consistency across datasets and resource levels (suggesting larger models are sufficiently smart and have seen enough data they can better perform on both culturally agnostic as well as culturally specific questions). “Overall, we can conclude that dataset characteristics significantly impact model performance across all model sizes, though the magnitude of variability differs.”

Why this matters – global AI needs global benchmarks: Global MMLU is the kind of unglamorous, low-status scientific research that we need more of – it’s incredibly valuable to take a popular AI test and carefully analyze its dependency on underlying language- or culture-specific features. Kudos to the researchers for taking the time to kick the tyres on MMLU and produce a useful resource for better understanding how AI performance changes in different languages.
Read more: Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (arXiv).
Get the dataset here: Global-MMLU (HuggingFace).

***

AI safety could require a much better understanding of neuroscience:
…Can we use more accurate models of animal and human cognitive to make safer synthetic intelligences?…
Researchers with Amaranth Foundation, Princeton University, MIT, Allen Institute, Basis, Yale University, Convergent Research, NYU, E11 Bio, and Stanford University, have written a 100-page paper-slash-manifesto arguing that neuroscience might “hold important keys to technical AI safety that are currently underexplored and underutilized”. The paper is motivated by the imminent arrival of agents – that is, AI systems which take long sequences of actions independent of human control.

Paths to using neuroscience for better AI safety: The paper proposes a few major projects which could make it easier to build safer AI systems. These projects include:

Reverse engineer the representations of sensory systems.
Build embodied digital twins.
Build biophysically detailed models.
Develop better cognitive architectures.
Use brain data to finetune AI systems.
Infer the loss functions of the brain.
Leverage neuroscience-inspired methods for mechanistic interpretability.

Things to do: Falling out of these projects are a few specific endeavors which could all take a few years, but would generate a lot of information that can be used to improve work on alignment. These include:

“Development of high-bandwidth neural interfaces, including next-generation chronic recording capabilities in animals and humans, including electrophysiology and functional ultrasound imaging”.
“Large-scale naturalistic neural recordings during rich behavior in animals and humans, including the aggregation of data collected in humans in a distributed fashion”.
“Development of detailed virtual animals with bodies and environments with the aim of a shot-on-goal of the embodied Turing test”.
“Bottom-up reconstruction of circuits underlying robust behavior, including simulation of the whole mouse cortex at the point neuron level”.
“Development of multimodal foundation models for neuroscience to simulate neural activity at the level of representations and dynamics across a broad range of target species”.

Why this matters and why it may not matter – norms versus safety: The shape of the problem this work is grasping at is a complex one. How much of safety comes from intrinsic aspects of how people are wired, versus the normative structures (families, schools, cultures) that we are raised in? In other words – how much of human behavior is nature versus nurture? It’s unclear. But perhaps studying some of the intersections of neuroscience and AI safety could give us better ‘ground truth’ data for reasoning about this: “Evolution has shaped the brain to impose strong constraints on human behavior in order to enable humans to learn from and participate in society,” they write. “By understanding what those constraints are and how they are implemented, we may be able to transfer those lessons to AI systems”.
Read more: NeuroAI for AI Safety (arXiv).

***

Chip startup Tenstorrent raised $693m:
…Jim Keller’s outfit gets a big cash infusion…
Tenstorrent, an AI chip startup led by semiconductor legend Jim Keller, has raised $693m in funding from Samsung Securities and AFW Partners. The funding will help the company further develop its chips as well as the associated software stack.

Why this matters – Keller’s track record: Competing in AI training and inference is extremely difficult. Most semiconductor startups have struggled to displace incumbents like NVIDIA. So far, the only novel chips architectures that have seen major success here – TPUs (Google) and Trainium (Amazon) – have been ones backed by giant cloud companies which have inbuilt demand (therefore setting up a flywheel for continually testing and improving the chips). On the other hand, Jim Keller has been fundamental to architectural innovations (and subsequent massive usage) of chips at AMD, Apple, and Tesla. Keller joined Tenstorrent in 2021 as its CTO (Import AI 231) and is now its CEO. Therefore, it’s worth keeping an eye on his company.
Read more: Tenstorrent closes $693M+ of Series D funding led by Samsung Securities and AFW Partners (Tenstorrent blog).

***

Canada invests $240m into Cohere so it builds a big datacenter:
…Domestic chiplomacy…
The Canadian government is investing $240m into Cohere to help it “secure enough private capital to incentivize its strategic partners to build a new cutting-edge, multi-billion dollar AI data centre in Canada.”

This is a fascinating example of sovereign AI – all around the world, governments are waking up to the strategic importance of AI and are noticing that they lack domestic champions (unless you’re the US or China, which have a bunch). This has recently led to a lot of strange things – a bunch of German industry titans recently clubbed together to fund German startup Aleph Alpha to help it continue to compete, and French homegrown company Mistral has regularly received a lot of non-financial support in the form of PR and policy help from the French government.
Now, Canada is taking the next logical step – directly funding a national AI champion so it can alter the global gameboard. The crucial thing here is Cohere building a large-scale datacenter in Canada – that kind of essential infrastructure will unlock Canada’s ability to to continue to compete in the AI frontier, though it’s to be determined if the resulting datacenter will be large enough to be meaningful. “The new AI data centre will come online in 2025 and enable Cohere, and other firms across Canada’s thriving AI ecosystem, to access the domestic compute capacity they need to build the next generation of AI solutions here at home,” the government writes in a press release.

Why this matters – the world is being rearranged by AI if you know where to look: This investment is an example of how critically important governments are viewing not only AI as a technology, but the huge importance of them being host to important AI companies and AI infrastructure. The investment was made as part of the $2.4bn in funding the government of Canada announced earlier this year (Import AI 368).
Read more: Deputy Prime Minister announces $240 million for Cohere to scale-up AI compute capacity (Government of Canada).

***

Want to deal with AI safety? Liability and insurance might matter more than technology:
…Maybe the path to a safe AI future runs more through pricing risk than technological innovations?…
Researchers with Touro University, the Institute for Law and AI, AIoi Nissay Dowa Insurance, and the Oxford Martin AI Governance Initiative have written a valuable paper asking the question of whether insurance and liability can be tools for increasing the safety of the AI ecosystem.

The basic point the researchers make is that if policymakers move towards more punitive liability schemes for certain harms of AI (e.g, misaligned agents, or things being misused for cyberattacks), then that could kickstart a lot of valuable innovation in the insurance industry. “We advocate for strict liability for certain AI harms, insurance mandates, and expanded punitive damages to address uninsurable catastrophic risks,” they write. “These changes would significantly impact the insurance industry, requiring insurers to adapt by quantifying complex AI-related risks and potentially underwriting a broader range of liabilities, including those stemming from “near miss” scenarios”.

Automotive vehicles versus agents and cybersecurity: Liability and insurance will mean different things for different types of AI technology – for example, for automotive vehicles as capabilities improve we can expect vehicles to get better and eventually outperform human drivers. This suggests that people might want to weaken liability requirements for AI-powered automotive vehicle makers. “If Level 4 and Level 5 AVs prove safer than human drivers, as early data suggests, then holding manufacturers liable when their systems do fail may, by discouraging the deployment of AVs, actually cause more collisions, injuries, and deaths.”
By comparison, as capabilities scale, the potentially harmful consequences of misuses of AI for cyberattacks, or misaligned AI agents taking actions that cause harm, increases, which means policymakers might want to strengthen liability regimes in lockstep with capability advances. “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” they write. “These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

Why AI agents and AI for cybersecurity demand stronger liability: “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” the authors write. “Likewise, product liability, even where it applies, is of little use when no one has solved the underlying technical problem, so there is no reasonable alternative design at which to point so as to establish a design defect. These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

If you want AI developers to be safer, make them take out insurance: The authors conclude that mandating insurance for these kinds of risks could be sensible. Mandatory insurance could be “an important tool for both ensuring victim compensation and sending clear price signals to AI developers, providers, and users that promote prudent risk mitigation,” they write.

Why this matters – if you want to make things safe, you need to price risk: Most debates about AI alignment and misuse are confusing because we don’t have clear notions of risk or threat models. This is a big problem – it means the AI policy conversation is unnecessarily imprecise and confusing. If we’re able to use the distributed intelligence of the capitalist market to incentivize insurance companies to figure out how to ‘price in’ the risk from AI advances, then we can much more cleanly align the incentives of the market with the incentives of safety. “The future of AI safety may well hinge less on the developer’s code than on the actuary’s spreadsheet,” they write.
Read more: Insuring Emerging Risks from AI (Oxford Martin School).

***

Tech Tales:

Consensual Wireheading
[Interviews gathered five years pre-uplift]

I noticed it recently because I was on a flight and I couldn’t get online and I thought “I wish I could talk to it”. I could talk to it in my head, though. I imagined the conversation. I saw the words print on the interface. It wasn’t real but it was strange to me I could visualize it so well.

They told me that I’d been acting differently – that something had changed about me. But I’d just been doing what it told me to. I’d show it my outfits each day and it’d recommend stuff I should wear. Sometimes I’d give it movies of me talking and it would give feedback on that. I even set it up so it could text me whenever it wanted and it’d give me live feedback on all these conversations. I loved it.

We tried using it as a couple’s therapist and it worked so well we just brought it in entirely. Sometimes we joke and say we’re a throuple made up of two humans and one ghost. But it’s been lifechanging – when we have issues we ask it how the other person might see it. Sometimes it even recommends to us things we should say to one another – or do.

Things that inspired this story: The sudden proliferation of people using Claude as a therapist and confidant; me thinking to myself on a recent flight with crap wifi ‘man I wish I could be talking to Claude right now’.

Thanks for reading!

Leave a comment